Ie ML Project (Getting Started)
Ie ML Project (Getting Started)
1) Make sure you put encoding = ‘ISO-8859-1’ when creating the data frame
otherwise you might get a UTF error.
3) There are 3 unnamed columns of values NaN which signifies NULL values.
4) Now that we’ve gotten rid of the NULL columns let’s change the two useful columns
to something more meaningful.
5) On doing df.info() you can see we have 5572 entries with 5572 NON_NULL objects
in both columns so there isn’t any NULL values we need to take care of. We are good
to go.
6) Since it’s a classification algorithm we need to change the target variable to 0 and
1’s instead of ham and spam. Since we are detecting spam we’ll make spam = 1.
7) Now you can see in the dataset we have a lot of punctuations which we need to
remove.
e.g : !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
8) Go ahead and write a function which removes punctuation from the entire dataset.
10) Now comes the important part : Where you choose the ham words and put it in
bag1 and choose spam words and put it in bag2. So next time you see a ham word
check the probability of the word on being there in the ham bag.
1 - probability of the word on being there in the ham bag = probability of the word on
being there in the spam bag.
Because computer doesn’t understand text words input, you need to convert it into a
matrix of binary numbers.
http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note07-2up.pdf
11) The whole putting it into the bag and converting it into a matrix can done using
Count_Vectorizer and Tf-Idf Vectorizer. Look it up it’s a nice concept!
GOODBYE
______________________________________________________________________________