0% found this document useful (0 votes)
82 views11 pages

Quiz 2

The document provides instructions for a quiz on text classification. It explains how to download data, prepare the data by cleaning and splitting it, build a naive Bayes classifier model to predict spam or ham, and test the model on held-out data. Key steps include splitting the dataset into training and test sets, finding word probabilities conditioned on spam or ham, and making predictions by calculating spam and ham scores for test messages. The goal is to classify short text messages as spam or non-spam (ham) using a naive Bayes approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views11 pages

Quiz 2

The document provides instructions for a quiz on text classification. It explains how to download data, prepare the data by cleaning and splitting it, build a naive Bayes classifier model to predict spam or ham, and test the model on held-out data. Key steps include splitting the dataset into training and test sets, finding word probabilities conditioned on spam or ham, and making predictions by calculating spam and ham scores for test messages. The goal is to classify short text messages as spam or non-spam (ham) using a naive Bayes approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

###QUIZ 2 Instructions

- Fill all the incomplete functions. Strictly follow the function specs.
- Do not copy or plagiarize. IIPE,VIZAG has a very strict policy against
plagiarism

Download and read file

### Download data from google drive. You need not mess with this code.

import requests

def download_file_from_google_drive(id, destination):


URL = "https://docs.google.com/uc?export=download"

session = requests.Session()

response = session.get(URL, params = { 'id' : id }, stream = True)


token = get_confirm_token(response)

if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)

save_response_content(response, destination)

def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value

return None

def save_response_content(response, destination):


CHUNK_SIZE = 32768

with open(destination, "wb") as f:


for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)

if __name__ == "__main__":
file_id = '1e_Azf9zGvSWsDhM9PP2sfMNKC72-iWAK'
destination = 'data.txt'
download_file_from_google_drive(file_id, destination)
with open('data.txt', 'r') as f:
data_raw = f.readlines()

1. Data preparation

Now the entire data is stored in the list data_raw.


Every line in the file is a different element of the list.
First let us look at the first five elements of the list.

1.1

Write a function that returns first five elements of the list if length of list is greater than or equal
to 5 and None value otherwise.

def first_five_in_list(l):
"""
Inputs:
l: Python list

Outputs:
l_5 : python list, first five elements of list if length of list greater
than 5; None otherwise
"""
### Your code here
return l_5

1.2

def remove_trailing_newlines(s):
"""
Function that removes all trailing newlines at the end of it
Inputs:
s : string

Outputs:
s_clean : string, string s but without newline characters at the end
"""
### Write your code here
return s_clean
If we apply remove_trailing_newlines to first element of data_ra

w, we get

You can see that the newline at the end has disappeared.

1.3

But we now we need to apply this function to the whole list.


Write a function named mapl, that takes two arguments - a function on elements of type t and a
list l of elements of type t and applies the function over all elements of the list l and returns them
as a list.

def mapl(f, l):


"""
Function that applies f over all elements of l
Inputs:
f : function, f takes elements of type t1 and returns elements of type
t2
l : list, list of elements of type t1

Ouptuts:
f_l : list, list of elements of type t2 obtained by applying f over ea
ch element of l
"""
### Write your code here

return f_l

Now we can use mapl to apply remove_trailing_newlines to all lines in data_raw

data_clean = mapl(remove_trailing_newlines, data_raw)


First five elements of data_clean look like this:

This is a dataset of text messages. We have to classify this into spam or ham. Ham means non-
spam relevant text messages. More details can be found here -
http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

You can see that each line starts by specifying whether the message is ham or spam and then there
is a tab character, \t followed by actual text message.
Now we need to split the lines to extract the two components - data label (ham or spam) and data
sample (the text message).

1.4

Write a function split_at_s that takes two strings - text and s.


It splits the string text into two parts at the first occurence of s.
Then it wraps both parts in a tuple and returns it.

def split_at_s(text, s):


"""Function that splits string text into two parts at the first occurenc
e of string s
Inputs:
text: string, string to be split
s : string, string of length 1 at which to split

Outputs:
split_text: tuple of size 2, contains text split in two (do not includ
e the string s at which split occurs in any of the split parts)
"""
### Write your code here
return split_text
Python has a very handy feature used to define short functions called lambda expressions. This is
from official python docs

Use lambda expressions and split_at_s to write a function, split_at_tab that takes only one
argument - text and splits at the first occurence of '\t' character. (If you can't understand
lambda expressions, just define the function in the ususal way)

### Write your code here


1.5

Now apply split_at_tab function over the elements of list data_clean and assign it to
variable named data_clean2.

#### Write your code here

After splitting at '\t' character, one data point looks like


this -

Now let us remove the punctuations in an sms.

import string
def remove_punctuations_and_lower(text):
"""Function that removes punctuations in a text
Inputs:
text: string
Outputs:
text_wo_punctuations
"""
return (text.translate(str.maketrans("","", string.punctuation))).lower(
)

1.6

Now use the function remove_punctuations to remove punctuations from the text part of all of
the tuples in data_clean2 and assign it to a variable named dataset

### Write your code here

First 5 elements of dataset look like this now.

Now let us count number of occurences of ham and spam in our dataset.
1.7

Write a function counter that takes two arguments -

 a list l of elements of type t


 a function f:t→u (means f takes an argument of type t and returns values of type u)

Counter returns a dictionary whose keys are u1,u2,…etc - unique values of type u obtained by
applying f over elements of l.
The values corresponding to the keys are the the number of times a particular key say u1 is
obtained when we apply f over elements of l

def counter(l, f):


"""
Function that returns a dictionary of counts of unique values obtained b
y applying f over elements of l
Inputs:
l: list; list of elements of type t
f: function; f takes arguments of type t and returns values of type u

Outputs:
count_dict: dictionary; keys are elements of type u, values are ints
"""
### Write your code here
return count_dict

1.8

Write a function named aux_func that can be passed to counter along with the list dataset to
get a dictionary containing counts of ham and spam

#### Write your code here

The counts of ham and spam as we can see are {'ham': 4827, 'spam': 747}

Now let us split our dataset into training and test sets. We'll first shuffle the elements of the
dataset, then we'll use 80% of data for training and 20% for testing.
1.9

Write a function that takes a list, randomly shuffles it and then returns it.
Hint: Use the random library of python - https://docs.python.org/3/library/random.html

Double-click (or enter) to edit

def random_shuffle(l):
"""Function that returns a randomly shuffled list
Inputs:
l: list
Outputs:
l_shuffled: list, contains same elements as l but randomly shuffled
"""
### Write your code here
return l_shuffled

1.10

Now split the shuffled list. Take 80% (4459) samples and assign them to a variable called
data_train . Put the rest in a variable called data_test

### Write your code here

2.Data Modeling

We shall use Naive Bayes for modelling our classifier. You can read about Naive Bayes from
here (https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes). But you
don't actually need to read it, because we are going to move step by step in building this
classifier.

First we need to find the probabilities P(wi|C)


We read P(A|B) as probability of event A, given event B.
P(wi|C) is probability that word wi occurs in the sms given that the sms belongs to class C
where C can be either spam or ham .
But we will be finding P~(wi|C) which is the smoothed probability function to take care of
words with 0 probabilities that may cause problems.

P~(wi|C)=Number of occurences of wi in all samples of class C+1Total number of words in all samples of class
C + Vocabulary size

2.1

Find the vocabulary - list of unique words in all smses of data_train and assign it to the
variable vocab

### Write your code here

2.2

For every word wi in vocab, find the count (total number of occurences) of wi in all smses of
type spam. Put these counts in a dictionary and assign it to a variable named dict_spam where
key is the word wi and value is the count.
In a similar way, create a variable called dict_ham which contains counts of each word in
vocabulary in smses of type ham. (This is only w.r.t samples in data_train)

### Write your code here

2.3

For every word wi in vocab, find the smoothed probability P~(wi| spam ) and put in a
dictionary named dict_prob_spam. In a similar way, define the dictionary dict_prob_ham
which contains smoothed probabilities P~(wi| ham )

### Write your code here

3. Prediction

We need to test our model on data_test . For each sample of data_test, prediction procedure
is as follows:
 For all words common to the sample and vocabulary, find spam_score and ham_score
 If spam_score is higher than ham_score, then we predict the sample to be spam and vice
versa.
 spam_score = P(spam)∗P~(w1| spam )∗P~(w2| spam )∗… where w1,w2,… are
words which occur both in the test sms and vocabulary.
 Similary, ham_score = P(ham)∗P~(w1| ham )∗P~(w2| ham )∗… where w1,w2,…
are words which occur both in the test sms and vocabulary.
Here P(spam)=Number of samples of type spam in training setTotal number of samples in training set
Similarly, P(ham)=Number of samples of type ham in training setTotal number of samples in training
set
(Note: The above is prediction procedure for a single sample in data_test)
Write a function predict which does this.

3.1
def predict(text, dict_prob_spam, dict_prob_ham, data_train):
"""Function which predicts the label of the sms
Inputs:
text: string, sms
dict_prob_spam: dictionary, contains dict_prob_spam as defined above
dict_prob_spam: dictionary, contains dict_prob_ham as defined above
data_train: list, list of tuples of type(label, sms), contains trainin
g dataset

Outputs:
prediction: string, one of two strings - either 'spam' or 'ham'
"""
### Write your code here
return prediction

3.2

Now find accuracy of the model. Apply function predict to all the samples in data_test.
accuracy=number of correct predictionssize of test set
Write the function accuracy which applies predict to all samples in data_test and returns
accuracy

def accuracy(data_test, dict_prob_spam, dict_prob_ham, data_train):


"""Function which finds accuracy of model
Inputs:
data_test: list, contains tuples of data (label, sms)
dict_prob_spam: dictionary, contains dict_prob_spam as defined above
dict_prob_spam: dictionary, contains dict_prob_ham as defined above
data_train: list, list of tuples of type(label, sms), contains trainin
g dataset

Outputs:
accuracy: float, value of accuracy
"""
### Write your code here
return accuracy

You might also like