Spam Classification
Spam Classification
known as the lemma. Stemming might yield only s when confronted with the token saw,
whereas lemmatization might try to return either see or saw depending on whether the
token was used as a verb or a noun.
Remember how the data (textual data) looked - it was just a collection of characters that
machines couldn't understand. Starting with this information, you'll take the following
steps:
Lexical Processing:
To begin, you'll just convert the raw text into words, and then, depending on your
application's requirements, sentences or paragraphs.If an email contains phrases like
lottery, reward, or luck, it's probably spam.
As a result, the collection of words in a sentence provides us a pretty decent notion of what
the statement implies in general. In order to make this group more reflective of the
sentence, many extra processing steps are frequently performed; for example, cat and cats
are regarded to be the same word.
In general, all plural words are equivalent to their singular counterparts.
Lexical processing is sufficient for simple applications such as spam detection, but it is
frequently insufficient for more complex applications like as machine translation. The words
"My cat ate its third meal" and "My cat ate its third meal," for example, have completely
different meanings. However, because the "group of words" in both sentences is the same,
lexical processing will perceive them as equal. As a result, we clearly require a more powerful
analysis system.
Syntactic Processing:
After lexical analysis, the next stage is to try to extract more meaning from the sentence, this
time utilising its syntax. To grasp what the meaning is, we look at the syntactic structures, or
the grammar of the language, rather than just the words.
Differentiating between the subject and the object of the sentence, i.e. determining who
performs the action and who is affected by it, is one example. "Ram thanked Shyam" and
"Shyam thanked Ram" are two statements with different meanings because in the first, Ram
performs the act of "thanking" and affects Shyam, but in the second, Shyam performs the act
of "thanking" and affects Ram. As a result, a syntactic analysis based on the subjects and
objects of a sentence will be able to make this distinction.
10 | P a g e
There are a number of other ways in which these syntactic analyses can aid our
comprehension. For example, if a question answering system understands that the phrases
"Prime Minister" and "India" are connected, it will perform substantially better when asked
"Who is the Prime Minister of India?"
Semantic Processing:
When it comes to advanced NLP applications like language translation and chatbots, lexical
and syntactic processing aren't enough. After completing the two stages outlined above, the
machine will still be unable to comprehend the meaning of the text. A question-answering
system, for example, may be unable to recognise that PM and Prime Minister mean the same
thing because of this incapability. As a result, if someone asks it, "Who is the Prime Minister
of India?" it may not be able to respond unless it has a separate database for PMs, because it
won't grasp that the terms PM and Prime Minister are interchangeable. You could record the
answer for both forms of the meaning (PM and Prime Minister) individually, but how many of
these meanings are you going to manually store?
This is commonly accomplished by inferring the meaning of a word from a group of words
that frequently occur around it. So, if the words PM and Prime Minister are commonly used
in conjunction with comparable nouns, you can presume that their meanings are also similar.
In fact, the machine should be able to understand additional semantic relations this way as
well. It should, for example, be able to recognise that the terms "King" and "Queen" are
related, and that "Queen" is simply the female counterpart of the word "King." Additionally,
both of these nouns can be grouped together under the term "monarch." You can definitely
save these relationships manually, but it will be much more helpful if you can educate your
computer to hunt for and understand these relationships on its own. We'll discuss how that
training can be carried out in more detail later.
You can use the meaning of the words you've obtained through semantic analysis for a variety
of purposes. Machine translation, chatbots, and a variety of other applications necessitate a
thorough knowledge of the text, from lexical to syntactic to semantic levels. As a result, lexical
and semantic processing are merely the "pre-processing" layer of the total process in most of
these applications. In certain basic applications, lexical processing alone may suffice as the
pre-processing step.
11 | P a g e
Word Frequency
Characters, words, phrases, and paragraphs now make up a text. Looking at the word
frequency distribution, or visualising the word frequencies of a particular text corpus, is the
most fundamental statistical study you can undertake.
When you plot word frequencies in a big corpus of text, such as a corpus of news articles, user
reviews, Wikipedia articles, and so on, you'll see a consistent trend. Professor Srinath will
present some fascinating insights from word frequency distributions in the next lecture. You'll
also discover what stopwords are and why they're less useful than other words.
To summarise, the Zipf's law (developed by linguist-statistician George Zipf) asserts that a
word's frequency is inversely related to its rank, with rank 1 denoting the most frequent term,
rank 2 denoting the second most frequent, and so on. The power law distribution is another
name for this.
The basic concept for stopwords is formed using Zipf's law: they are the words with the
highest frequencies (or lowest ranks) in the text, and are often of minimal 'importance.'
1. Stop words, such as 'is,' 'an,' and 'the,' are extremely common.
2. Significant words are those that are usually more crucial to comprehending the text.
3. Terms that appear infrequently and are therefore less relevant than significant words
Tokenisation
Another consideration is how to extract features from the messages so that they may be
utilised to construct a classifier. When building a machine learning model, such as a spam
detector, you must feed in features from each message that the machine learning algorithm
may use to develop the model. However, there are only two columns in the spam dataset:
one contains the message and the other provides the label associated with the message.
Machine learning, as you may know, works with numeric data rather than words. Previously,
you either regarded text columns as categorical variables and turned each categorical variable
to a numeric variable by assigning numeric values to each category, or you created dummy
variables.
Because the message column is unique and not a category variable, you can't do any of these.
Your model will fail badly if you approach it as a category. You can use it as a workout.
You will extract features from the messages to solve this problem. You'll extract each word
from each message by splitting it down into individual words, or 'tokens.'
Tokenisation is a text splitting technique that divides the text into smaller components.
Depending on the application, these elements can be characters, phrases, sentences, or even
chapters.
12 | P a g e
In the case of the spam detector, you'll divide each message down into various terms, which
is referred to as word tokenisation. Other sorts of tokenisation techniques exist as well, such
as character tokenisation, sentence tokenisation, and so on. In different contexts, different
methods of tokenization are required.
It seeks to deduce a document's meaning solely from its content and thinks that documents with
similar content are similar to one another. NLP algorithms cannot be fed text directly. They are
numerically oriented. The text is transformed into a collection of words by the model. The bag-of-
words keeps track of how many times each of the text's most common words appears. The model
converts text into fixed-length vectors by counting how many times each word appears.
The data should first be pre-processed. The text must be transformed to lower case, with all non-
word letters and punctuations deleted.
The next step is to find the most frequently occurring words in the text. The vocabulary must be
established, each sentence tokenized into words, and the number of times each word appears
counted. Following that, the model is built. To identify whether a word is a frequent word, a vector
is created. It is set to 1 if it is a frequently used word, and to 0 if it is not. And now you have your
result.
The bag-of-words model's most significant advantage is its simplicity and ease of usage. It can be
used to make a rough draught model before moving on to more complex word embeddings.
Although the bag-of-words paradigm is simple to grasp and apply, it does have significant limitations
and downsides. The vocabulary/dictionary needs to be designed very carefully. Its size has an impact
on the sparsity of the document representations and must be managed well.
The model ignores context by discarding the meaning of the words and focusing on frequency of
occurrence. This can be a major problem, because the arrangement of the words in a sentence can
completely change the meaning of the sentence and the model cannot account for this.
Another major drawback of this model is that it is rather difficult to model sparse representations.
This is due to informational reasons as well as computational reasons. The model finds it difficult to
harness a small amount of information in a vast representational space.
13 | P a g e
Naïve Bayes Algorithm
It's a classification method based on Bayes' Theorem and the assumption of predictor independence.
A Naive Bayes classifier, in simple terms, posits that the existence of one feature in a class is unrelated
to the presence of any other feature.
For example, if a fruit is red, round, and roughly 3 inches in diameter, it is termed an apple. Even if
these characteristics are reliant on one another or on the presence of other characteristics, they all
add to the likelihood that this fruit is an apple, which is why it is called 'Naive.'The Naive Bayes model
is simple to construct and is especially good for huge data sets. Naive Bayes is renowned to outperform
even the most advanced classification systems due to its simplicity.The Bayes theorem allows you to
calculate posterior probability P(c|x) from P(c), P(x), and P(x|c) using P(c), P(x), and P(x|c).
Predicting the test data set's class is simple and quick. It's also good at multi-class prediction.
When the assumption of independence is true, a Naive Bayes classifier outperforms other
models such as logistic regression, and it requires less training data.
When compared to numerical input variables, it performs well with categorical input
variables (s). A normal distribution is assumed for numerical variables (bell curve, which is a
strong assumption).
Cons:
If a categorical variable in the test data set has a category that was not included in the
training data set, the model will assign a probability of 0 (zero) and will be unable to
generate a prediction. This is commonly referred to as "Zero Frequency." We can utilise the
smoothing approach to remedy this. Laplace estimation is one of the most basic smoothing
techniques.
On the other hand, because naive Bayes is a lousy estimator, the probability outputs from
predict probability should be regarded with caution.
14 | P a g e
The assumption of independent predictors is another flaw in Naive Bayes. In actual life,
getting a collection of predictors that are totally independent is nearly impossible.
Real-time Prediction: Naive Bayes is a fast and eager learning classifier. As a result, it might
be utilised to make real-time forecasts.
TF-IDF Representation
The TF-IDF (term frequency-inverse document frequency) statistic examines the relevance of a word
to a document in a collection of documents.
This is accomplished by multiplying two metrics: the number of times a word appears in a document
and the word's inverse document frequency over a collection of documents.
It has a variety of applications, including automatic text analysis and scoring words in machine
learning techniques for Natural Language Processing (NLP).
The TF-IDF format was created for document search and retrieval. It works by growing in proportion
to the number of times a word appears in a document, but offset by the number of papers
15 | P a g e
containing the word. As a result, words like this, what, and if, which appear frequently in all
documents, rank low since they don't mean much to that document in particular.
However, if the word Bug appears frequently in one document but not in others, it is likely to be very
relevant. If we're trying to figure out which themes some NPS replies belong to, the term Bug, for
example, will almost certainly be associated with the topic Reliability, because most responses
including that word will be about that topic.
TF-IDF for the word in a document is calculated by multiplying two different metrics:
The number of times a word appears in a document. The simplest method for
calculating this frequency is to simply count the number of times a word appears in a
document. The frequency can then be adjusted based on the length of the document
or the raw frequency of the most frequently used word in the document.
The word's inverse document frequency over a collection of documents. This refers
to how common or uncommon a word is within the entire document set. The closer
a term is to zero, the more common it is. The logarithm may be determined by taking
the total number of documents, dividing it by the number of documents that contain
a word, and then multiplying by the total number of documents.
If a word is really common and appears in a lot of texts, this figure will be close to
zero. Otherwise, it will be close to 1.
To put it another way, the TF-IDF score for the word t in document d from document
set D is determined as follows:
16 | P a g e
Applications of TF-IDF
Information Retrieval
TF-IDF was created for document search and can be utilised to offer the most relevant results for
what you're looking for. Assume you have a search engine, and someone is looking for LeBron
James. The results will be presented in the order of their importance. That is, because TF-IDF gives
the term LeBron a higher score, the most relevant sports articles will be ranked higher.
TF-IDF scores are almost certainly used in the algorithms of any search engine you've ever used.
Keyword Extraction
The TF-IDF can also be used to extract keywords from text. How? The words in a document with the
highest scores are the most relevant to that document, and hence can be deemed keywords for that
document. It's all really simple.
Conclusion:-
Finally, we went through the process of building the spam detector utilising all of the
preprocessing procedures that you had previously learned. To create the spam classifier, we
used the NLTK library rather than the scikit-learn library.
This could help companies that could help them engage with the customers in a best
possible way and would ensure that content creator doesnot produce content that goes in
Spam folder.
17 | P a g e