0% found this document useful (0 votes)

73 views8 pages

Spam Classification

The document discusses various natural language processing techniques including lexical, syntactic, and semantic processing. Lexical processing involves converting text to words. Syntactic processing analyzes sentence structure and grammar. Semantic processing involves determining word meanings and relationships to fully understand language.

Uploaded by

HARSHIT KUMAR SRIVASTAVA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views8 pages

Spam Classification

Uploaded by

HARSHIT KUMAR SRIVASTAVA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

removing only inflectional endings and returning the base or dictionary form of a word,

known as the lemma. Stemming might yield only s when confronted with the token saw,
whereas lemmatization might try to return either see or saw depending on whether the
token was used as a verb or a noun.
Remember how the data (textual data) looked - it was just a collection of characters that
machines couldn't understand. Starting with this information, you'll take the following
steps:

Lexical Processing:
To begin, you'll just convert the raw text into words, and then, depending on your
application's requirements, sentences or paragraphs.If an email contains phrases like
lottery, reward, or luck, it's probably spam.
As a result, the collection of words in a sentence provides us a pretty decent notion of what
the statement implies in general. In order to make this group more reflective of the
sentence, many extra processing steps are frequently performed; for example, cat and cats
are regarded to be the same word.
In general, all plural words are equivalent to their singular counterparts.

Lexical processing is sufficient for simple applications such as spam detection, but it is
frequently insufficient for more complex applications like as machine translation. The words
"My cat ate its third meal" and "My cat ate its third meal," for example, have completely
different meanings. However, because the "group of words" in both sentences is the same,
lexical processing will perceive them as equal. As a result, we clearly require a more powerful
analysis system.

Syntactic Processing:
After lexical analysis, the next stage is to try to extract more meaning from the sentence, this
time utilising its syntax. To grasp what the meaning is, we look at the syntactic structures, or
the grammar of the language, rather than just the words.

Differentiating between the subject and the object of the sentence, i.e. determining who
performs the action and who is affected by it, is one example. "Ram thanked Shyam" and
"Shyam thanked Ram" are two statements with different meanings because in the first, Ram
performs the act of "thanking" and affects Shyam, but in the second, Shyam performs the act
of "thanking" and affects Ram. As a result, a syntactic analysis based on the subjects and
objects of a sentence will be able to make this distinction.

10 | P a g e
There are a number of other ways in which these syntactic analyses can aid our
comprehension. For example, if a question answering system understands that the phrases
"Prime Minister" and "India" are connected, it will perform substantially better when asked
"Who is the Prime Minister of India?"

Semantic Processing:

When it comes to advanced NLP applications like language translation and chatbots, lexical
and syntactic processing aren't enough. After completing the two stages outlined above, the
machine will still be unable to comprehend the meaning of the text. A question-answering
system, for example, may be unable to recognise that PM and Prime Minister mean the same
thing because of this incapability. As a result, if someone asks it, "Who is the Prime Minister
of India?" it may not be able to respond unless it has a separate database for PMs, because it
won't grasp that the terms PM and Prime Minister are interchangeable. You could record the
answer for both forms of the meaning (PM and Prime Minister) individually, but how many of
these meanings are you going to manually store?

This is commonly accomplished by inferring the meaning of a word from a group of words
that frequently occur around it. So, if the words PM and Prime Minister are commonly used
in conjunction with comparable nouns, you can presume that their meanings are also similar.

In fact, the machine should be able to understand additional semantic relations this way as
well. It should, for example, be able to recognise that the terms "King" and "Queen" are
related, and that "Queen" is simply the female counterpart of the word "King." Additionally,
both of these nouns can be grouped together under the term "monarch." You can definitely
save these relationships manually, but it will be much more helpful if you can educate your
computer to hunt for and understand these relationships on its own. We'll discuss how that
training can be carried out in more detail later.

You can use the meaning of the words you've obtained through semantic analysis for a variety
of purposes. Machine translation, chatbots, and a variety of other applications necessitate a
thorough knowledge of the text, from lexical to syntactic to semantic levels. As a result, lexical
and semantic processing are merely the "pre-processing" layer of the total process in most of
these applications. In certain basic applications, lexical processing alone may suffice as the
pre-processing step.

11 | P a g e
Word Frequency
Characters, words, phrases, and paragraphs now make up a text. Looking at the word
frequency distribution, or visualising the word frequencies of a particular text corpus, is the
most fundamental statistical study you can undertake.

When you plot word frequencies in a big corpus of text, such as a corpus of news articles, user
reviews, Wikipedia articles, and so on, you'll see a consistent trend. Professor Srinath will
present some fascinating insights from word frequency distributions in the next lecture. You'll
also discover what stopwords are and why they're less useful than other words.

To summarise, the Zipf's law (developed by linguist-statistician George Zipf) asserts that a
word's frequency is inversely related to its rank, with rank 1 denoting the most frequent term,
rank 2 denoting the second most frequent, and so on. The power law distribution is another
name for this.

The basic concept for stopwords is formed using Zipf's law: they are the words with the
highest frequencies (or lowest ranks) in the text, and are often of minimal 'importance.'

In general, any text corpus has three types of words:

1. Stop words, such as 'is,' 'an,' and 'the,' are extremely common.
2. Significant words are those that are usually more crucial to comprehending the text.
3. Terms that appear infrequently and are therefore less relevant than significant words

Tokenisation
Another consideration is how to extract features from the messages so that they may be
utilised to construct a classifier. When building a machine learning model, such as a spam
detector, you must feed in features from each message that the machine learning algorithm
may use to develop the model. However, there are only two columns in the spam dataset:
one contains the message and the other provides the label associated with the message.
Machine learning, as you may know, works with numeric data rather than words. Previously,
you either regarded text columns as categorical variables and turned each categorical variable
to a numeric variable by assigning numeric values to each category, or you created dummy
variables.

Because the message column is unique and not a category variable, you can't do any of these.
Your model will fail badly if you approach it as a category. You can use it as a workout.

You will extract features from the messages to solve this problem. You'll extract each word
from each message by splitting it down into individual words, or 'tokens.'

Tokenisation is a text splitting technique that divides the text into smaller components.
Depending on the application, these elements can be characters, phrases, sentences, or even
chapters.

12 | P a g e
In the case of the spam detector, you'll divide each message down into various terms, which
is referred to as word tokenisation. Other sorts of tokenisation techniques exist as well, such
as character tokenisation, sentence tokenisation, and so on. In different contexts, different
methods of tokenization are required.

Final bag-of-words representation

When we apply machine learning methods to model text, we employ the bag-of-words model to
represent text data. It is the most basic numerical representation of text. It is used for language
modelling and document classification and is relatively simple to learn and execute. It's a technique
for extracting features from text for modelling purposes. A lexicon of known words plus a measure of
the presence of known words make up a bag-of-words. It defines the order in which words appear in
a document. The model is simply concerned with whether or not known words appear in the
document. It doesn't matter where they appear in the document; all that matters is that they appear.

It seeks to deduce a document's meaning solely from its content and thinks that documents with
similar content are similar to one another. NLP algorithms cannot be fed text directly. They are
numerically oriented. The text is transformed into a collection of words by the model. The bag-of-
words keeps track of how many times each of the text's most common words appears. The model
converts text into fixed-length vectors by counting how many times each word appears.

Bag-of-words model working methodology

The data should first be pre-processed. The text must be transformed to lower case, with all non-
word letters and punctuations deleted.

The next step is to find the most frequently occurring words in the text. The vocabulary must be
established, each sentence tokenized into words, and the number of times each word appears
counted. Following that, the model is built. To identify whether a word is a frequent word, a vector
is created. It is set to 1 if it is a frequently used word, and to 0 if it is not. And now you have your
result.

Biggest advantage of the bag-of-words model

The bag-of-words model's most significant advantage is its simplicity and ease of usage. It can be
used to make a rough draught model before moving on to more complex word embeddings.

Limitations and drawbacks of the bag-of-words model

Although the bag-of-words paradigm is simple to grasp and apply, it does have significant limitations
and downsides. The vocabulary/dictionary needs to be designed very carefully. Its size has an impact
on the sparsity of the document representations and must be managed well.
The model ignores context by discarding the meaning of the words and focusing on frequency of
occurrence. This can be a major problem, because the arrangement of the words in a sentence can
completely change the meaning of the sentence and the model cannot account for this.
Another major drawback of this model is that it is rather difficult to model sparse representations.
This is due to informational reasons as well as computational reasons. The model finds it difficult to
harness a small amount of information in a vast representational space.

13 | P a g e
Naïve Bayes Algorithm

It's a classification method based on Bayes' Theorem and the assumption of predictor independence.
A Naive Bayes classifier, in simple terms, posits that the existence of one feature in a class is unrelated
to the presence of any other feature.
For example, if a fruit is red, round, and roughly 3 inches in diameter, it is termed an apple. Even if
these characteristics are reliant on one another or on the presence of other characteristics, they all
add to the likelihood that this fruit is an apple, which is why it is called 'Naive.'The Naive Bayes model
is simple to construct and is especially good for huge data sets. Naive Bayes is renowned to outperform
even the most advanced classification systems due to its simplicity.The Bayes theorem allows you to
calculate posterior probability P(c|x) from P(c), P(x), and P(x|c) using P(c), P(x), and P(x|c).

Pros and Cons of Naive Bayes

Pros:

 Predicting the test data set's class is simple and quick. It's also good at multi-class prediction.
 When the assumption of independence is true, a Naive Bayes classifier outperforms other
models such as logistic regression, and it requires less training data.
 When compared to numerical input variables, it performs well with categorical input
variables (s). A normal distribution is assumed for numerical variables (bell curve, which is a
strong assumption).

Cons:

 If a categorical variable in the test data set has a category that was not included in the
training data set, the model will assign a probability of 0 (zero) and will be unable to
generate a prediction. This is commonly referred to as "Zero Frequency." We can utilise the
smoothing approach to remedy this. Laplace estimation is one of the most basic smoothing
techniques.
 On the other hand, because naive Bayes is a lousy estimator, the probability outputs from
predict probability should be regarded with caution.

14 | P a g e
 The assumption of independent predictors is another flaw in Naive Bayes. In actual life,
getting a collection of predictors that are totally independent is nearly impossible.

Applications of Naive Bayes Algorithms

 Real-time Prediction: Naive Bayes is a fast and eager learning classifier. As a result, it might
be utilised to make real-time forecasts.

 Multi-class Prediction: This algorithm's multi-class prediction feature is well-known. We can

anticipate the likelihood of various target variable classes here.

 Text classification/spam filtering/sentiment analysis: Naive Bayes classifiers are more

successful than other algorithms in text classification (owing to superior results in multi-class
issues and the independence rule). As a result, it's commonly utilised in spam filtering
(detection of spam e-mail) and emotion analysis (in social media analysis, to identify positive
and negative customer sentiments)

 Recommendation System: The combination of a Naive Bayes Classifier and Collaborative

Filtering creates a Recommendation System that combines machine learning and data mining
techniques to filter unseen data and forecast whether a user would like or not like a given
resource.

TF-IDF Representation

The TF-IDF (term frequency-inverse document frequency) statistic examines the relevance of a word
to a document in a collection of documents.

This is accomplished by multiplying two metrics: the number of times a word appears in a document
and the word's inverse document frequency over a collection of documents.

It has a variety of applications, including automatic text analysis and scoring words in machine
learning techniques for Natural Language Processing (NLP).

The TF-IDF format was created for document search and retrieval. It works by growing in proportion
to the number of times a word appears in a document, but offset by the number of papers

15 | P a g e
containing the word. As a result, words like this, what, and if, which appear frequently in all
documents, rank low since they don't mean much to that document in particular.

However, if the word Bug appears frequently in one document but not in others, it is likely to be very
relevant. If we're trying to figure out which themes some NPS replies belong to, the term Bug, for
example, will almost certainly be associated with the topic Reliability, because most responses
including that word will be about that topic.

TF-IDF calculation methodology

TF-IDF for the word in a document is calculated by multiplying two different metrics:

 The number of times a word appears in a document. The simplest method for
calculating this frequency is to simply count the number of times a word appears in a
document. The frequency can then be adjusted based on the length of the document
or the raw frequency of the most frequently used word in the document.

 The word's inverse document frequency over a collection of documents. This refers
to how common or uncommon a word is within the entire document set. The closer
a term is to zero, the more common it is. The logarithm may be determined by taking
the total number of documents, dividing it by the number of documents that contain
a word, and then multiplying by the total number of documents.

 If a word is really common and appears in a lot of texts, this figure will be close to
zero. Otherwise, it will be close to 1.

 The TF-IDF score of a word in a document is calculated by multiplying these two

integers. The greater the score, the more important the word in that paper is.

 To put it another way, the TF-IDF score for the word t in document d from document
set D is determined as follows:

16 | P a g e
Applications of TF-IDF

Information Retrieval

TF-IDF was created for document search and can be utilised to offer the most relevant results for
what you're looking for. Assume you have a search engine, and someone is looking for LeBron
James. The results will be presented in the order of their importance. That is, because TF-IDF gives
the term LeBron a higher score, the most relevant sports articles will be ranked higher.

TF-IDF scores are almost certainly used in the algorithms of any search engine you've ever used.

Keyword Extraction

The TF-IDF can also be used to extract keywords from text. How? The words in a document with the
highest scores are the most relevant to that document, and hence can be deemed keywords for that
document. It's all really simple.

Conclusion:-

Finally, we went through the process of building the spam detector utilising all of the
preprocessing procedures that you had previously learned. To create the spam classifier, we
used the NLTK library rather than the scikit-learn library.

This could help companies that could help them engage with the customers in a best
possible way and would ensure that content creator doesnot produce content that goes in
Spam folder.

17 | P a g e

Semantic Analysis
100% (1)
Semantic Analysis
16 pages
Article Rosenthal
No ratings yet
Article Rosenthal
9 pages
Milieu Communication Training For Late Talkers
No ratings yet
Milieu Communication Training For Late Talkers
7 pages
NLP Basics
No ratings yet
NLP Basics
7 pages
Applied Text Analysis 2
No ratings yet
Applied Text Analysis 2
30 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Natural Language Processing
No ratings yet
Natural Language Processing
41 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
18 pages
AI Unit 3 Lecture 2
No ratings yet
AI Unit 3 Lecture 2
8 pages
Unit 5
No ratings yet
Unit 5
45 pages
Introduction To NLP
No ratings yet
Introduction To NLP
37 pages
Unit Ai 4
No ratings yet
Unit Ai 4
25 pages
Unit 4
No ratings yet
Unit 4
15 pages
Unit 4 Ai
No ratings yet
Unit 4 Ai
15 pages
Unit 3
No ratings yet
Unit 3
18 pages
Unit V Expert Systems Notes
No ratings yet
Unit V Expert Systems Notes
15 pages
Unit 3 and 4 Notes
No ratings yet
Unit 3 and 4 Notes
27 pages
Ai 6
No ratings yet
Ai 6
55 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
Apex Institute of Technology Bachelor of Engineering (Computer Science & Subject: Natural Language Processing Subject Code
No ratings yet
Apex Institute of Technology Bachelor of Engineering (Computer Science & Subject: Natural Language Processing Subject Code
18 pages
Module-3 Part A
No ratings yet
Module-3 Part A
7 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
(A) What Is Traditional Model of NLP?: Unit - 1
No ratings yet
(A) What Is Traditional Model of NLP?: Unit - 1
18 pages
Fundaments of Text Analysis
No ratings yet
Fundaments of Text Analysis
14 pages
NLP - Mid 2 Examination
No ratings yet
NLP - Mid 2 Examination
4 pages
INFOSYS Natural Language Processing
No ratings yet
INFOSYS Natural Language Processing
13 pages
Not Official Answer
No ratings yet
Not Official Answer
40 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
NLP M4 Part 1 SPP
No ratings yet
NLP M4 Part 1 SPP
57 pages
Cse 4022
No ratings yet
Cse 4022
284 pages
Introduction To Semantic Processing
No ratings yet
Introduction To Semantic Processing
13 pages
Introduction To NLP and Ambiguity
No ratings yet
Introduction To NLP and Ambiguity
42 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
54 pages
NLP U5
No ratings yet
NLP U5
26 pages
NLP Unit 3
No ratings yet
NLP Unit 3
83 pages
NLP Module 4
No ratings yet
NLP Module 4
45 pages
NLP
No ratings yet
NLP
78 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
Seminar On Natural Language Processing
No ratings yet
Seminar On Natural Language Processing
21 pages
Introduction
No ratings yet
Introduction
23 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
NLPQB2
No ratings yet
NLPQB2
8 pages
?? ??? ????????? ?????????
No ratings yet
?? ??? ????????? ?????????
23 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
Chapter 6-NLP Basics
No ratings yet
Chapter 6-NLP Basics
27 pages
History of NLP
No ratings yet
History of NLP
7 pages
Unit 3-1
No ratings yet
Unit 3-1
66 pages
Lec 1
No ratings yet
Lec 1
23 pages
NLP Key
No ratings yet
NLP Key
16 pages
NLP Notes Last Sem
No ratings yet
NLP Notes Last Sem
48 pages
Ai TXT Unit1
No ratings yet
Ai TXT Unit1
13 pages
Unit 4
No ratings yet
Unit 4
16 pages
NLP Ambiguity
No ratings yet
NLP Ambiguity
35 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
404-BA-Chapter V
No ratings yet
404-BA-Chapter V
22 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP
No ratings yet
NLP
8 pages
NLP Soln
No ratings yet
NLP Soln
6 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
63 pages
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Ten Lectures On Language Culture and Mind Cultural Developmental and Evolutionary Perspectives in Cognitive Linguistics 1st Edition Chris Sinha
100% (2)
Ten Lectures On Language Culture and Mind Cultural Developmental and Evolutionary Perspectives in Cognitive Linguistics 1st Edition Chris Sinha
65 pages
Lesson Project in The 5th Form Topic: "Summer Is Fun"
No ratings yet
Lesson Project in The 5th Form Topic: "Summer Is Fun"
10 pages
Methods - Unit 3
No ratings yet
Methods - Unit 3
30 pages
KidsBox AE TeachersBook3 Unit 1
No ratings yet
KidsBox AE TeachersBook3 Unit 1
16 pages
Mpre-Lac-Mtb - Mle
No ratings yet
Mpre-Lac-Mtb - Mle
2 pages
FCE Writing Sample
No ratings yet
FCE Writing Sample
7 pages
Class V English Fa III
No ratings yet
Class V English Fa III
4 pages
Jaden Dunbar - #1 Daily Language
No ratings yet
Jaden Dunbar - #1 Daily Language
3 pages
Private International English School, Abu Dhabi Timetable Virtual Class - Grade X
No ratings yet
Private International English School, Abu Dhabi Timetable Virtual Class - Grade X
2 pages
Chapter 13 - 21
50% (4)
Chapter 13 - 21
7 pages
Collocroell
No ratings yet
Collocroell
34 pages
Section 2 Python
No ratings yet
Section 2 Python
62 pages
Unit 11 Relative Clauses
No ratings yet
Unit 11 Relative Clauses
6 pages
Tugas B.Inggris Lintas Minat
No ratings yet
Tugas B.Inggris Lintas Minat
1 page
Coding Decoding Quiz 41
No ratings yet
Coding Decoding Quiz 41
5 pages
DS Lectures
No ratings yet
DS Lectures
41 pages
That Football Player Is Really Putting The Team On His Back This Evening!
No ratings yet
That Football Player Is Really Putting The Team On His Back This Evening!
2 pages
5 Theoretical and Applied Linguistics
No ratings yet
5 Theoretical and Applied Linguistics
7 pages
02 MemoryAndData PDF
No ratings yet
02 MemoryAndData PDF
17 pages
LearnEnglishKids Writing Practice Level 1 My Family
No ratings yet
LearnEnglishKids Writing Practice Level 1 My Family
4 pages
RU CARRIER 69NT40-541-301-314-328 SPLST
No ratings yet
RU CARRIER 69NT40-541-301-314-328 SPLST
93 pages
Blake PhilippineLiterature 1911
No ratings yet
Blake PhilippineLiterature 1911
10 pages
Dbms Lab Manual - II B.tech It Semii (2017-18)
No ratings yet
Dbms Lab Manual - II B.tech It Semii (2017-18)
83 pages
HOW I TAUGHT IV SEM NOTES NIKHIL New
No ratings yet
HOW I TAUGHT IV SEM NOTES NIKHIL New
3 pages
2000 Habermas & Bluck 2000
No ratings yet
2000 Habermas & Bluck 2000
23 pages
Final Adverbs
No ratings yet
Final Adverbs
56 pages
Introduction To Legal English
No ratings yet
Introduction To Legal English
53 pages
Document
No ratings yet
Document
2 pages

Spam Classification

Uploaded by

Spam Classification

Uploaded by

removing only inflectional endings and returning the base or dictionary form of a word,

In general, any text corpus has three types of words:

Final bag-of-words representation

Bag-of-words model working methodology

Biggest advantage of the bag-of-words model

Limitations and drawbacks of the bag-of-words model

Pros and Cons of Naive Bayes

Applications of Naive Bayes Algorithms

 Multi-class Prediction: This algorithm's multi-class prediction feature is well-known. We can

 Text classification/spam filtering/sentiment analysis: Naive Bayes classifiers are more

 Recommendation System: The combination of a Naive Bayes Classifier and Collaborative

TF-IDF calculation methodology

 The TF-IDF score of a word in a document is calculated by multiplying these two

You might also like