0% found this document useful (0 votes)
53 views3 pages

Ie ML Project (Getting Started)

1) The document provides guidance on building a machine learning model to classify spam emails. 2) It outlines 13 steps, noting that steps 7-13 involve more complex concepts that should be learned thoroughly. 3) Key steps include preprocessing the text data by removing punctuation and null values, converting the target variable to binary values, and using count vectorization and Naive Bayes classification to build the model.

Uploaded by

nicool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views3 pages

Ie ML Project (Getting Started)

1) The document provides guidance on building a machine learning model to classify spam emails. 2) It outlines 13 steps, noting that steps 7-13 involve more complex concepts that should be learned thoroughly. 3) Key steps include preprocessing the text data by removing punctuation and null values, converting the target variable to binary values, and using count vectorization and Naive Bayes classification to build the model.

Uploaded by

nicool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

IE ML PROJECT(GETTING STARTED)

FROM STEP 7-13 THE LEARNING CURVE IS PRETTY HIGH SO I’D


REQUEST YOU TO TAKE YOUR TIME AND LEARN THESE CONCEPTS
THOROUGHLY AND THEN PROCEED. YOU’LL GET A LOT OF ERRORS
WHILE WORKING ON IT BUT IT’S ALL PART OF THE PROCESS! WE’VE ALL
BEEN THERE!

A few things to notice :

1) Make sure you put encoding = ‘ISO-8859-1’ when creating the data frame
otherwise you might get a UTF error.

2) On printing the dataset you can see something like this:

3) There are 3 unnamed columns of values NaN which signifies NULL values.

We don't need them so go ahead and do:

4) Now that we’ve gotten rid of the NULL columns let’s change the two useful columns
to something more meaningful.

5) On doing df.info() you can see we have 5572 entries with 5572 NON_NULL objects
in both columns so there isn’t any NULL values we need to take care of. We are good
to go.

6) Since it’s a classification algorithm we need to change the target variable to 0 and
1’s instead of ham and spam. Since we are detecting spam we’ll make spam = 1.

7) Now you can see in the dataset we have a lot of punctuations which we need to
remove.

e.g : !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

8) Go ahead and write a function which removes punctuation from the entire dataset.

(Hint : Use : string.punctuation)

9) Split the data using train test split.

10) Now comes the important part : Where you choose the ham words and put it in
bag1 and choose spam words and put it in bag2. So next time you see a ham word
check the probability of the word on being there in the ham bag.

1 - probability of the word on being there in the ham bag = probability of the word on
being there in the spam bag.

Because computer doesn’t understand text words input, you need to convert it into a
matrix of binary numbers.

For more information refer to this:

http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note07-2up.pdf

11) The whole putting it into the bag and converting it into a matrix can done using
Count_Vectorizer and Tf-Idf Vectorizer. Look it up it’s a nice concept!

12) Use Count_Vectorizer/Tf-Idf Vectorizer and fit it in Multinomial Naive Bayes.

13) Use a preferred metrics to find the accuracy.

GOODBYE

______________________________________________________________________________

You might also like