Quiz 2
Quiz 2
- Fill all the incomplete functions. Strictly follow the function specs.
- Do not copy or plagiarize. IIPE,VIZAG has a very strict policy against
plagiarism
### Download data from google drive. You need not mess with this code.
import requests
session = requests.Session()
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
if __name__ == "__main__":
file_id = '1e_Azf9zGvSWsDhM9PP2sfMNKC72-iWAK'
destination = 'data.txt'
download_file_from_google_drive(file_id, destination)
with open('data.txt', 'r') as f:
data_raw = f.readlines()
1. Data preparation
1.1
Write a function that returns first five elements of the list if length of list is greater than or equal
to 5 and None value otherwise.
def first_five_in_list(l):
"""
Inputs:
l: Python list
Outputs:
l_5 : python list, first five elements of list if length of list greater
than 5; None otherwise
"""
### Your code here
return l_5
1.2
def remove_trailing_newlines(s):
"""
Function that removes all trailing newlines at the end of it
Inputs:
s : string
Outputs:
s_clean : string, string s but without newline characters at the end
"""
### Write your code here
return s_clean
If we apply remove_trailing_newlines to first element of data_ra
w, we get
You can see that the newline at the end has disappeared.
1.3
Ouptuts:
f_l : list, list of elements of type t2 obtained by applying f over ea
ch element of l
"""
### Write your code here
return f_l
This is a dataset of text messages. We have to classify this into spam or ham. Ham means non-
spam relevant text messages. More details can be found here -
http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
You can see that each line starts by specifying whether the message is ham or spam and then there
is a tab character, \t followed by actual text message.
Now we need to split the lines to extract the two components - data label (ham or spam) and data
sample (the text message).
1.4
Outputs:
split_text: tuple of size 2, contains text split in two (do not includ
e the string s at which split occurs in any of the split parts)
"""
### Write your code here
return split_text
Python has a very handy feature used to define short functions called lambda expressions. This is
from official python docs
Use lambda expressions and split_at_s to write a function, split_at_tab that takes only one
argument - text and splits at the first occurence of '\t' character. (If you can't understand
lambda expressions, just define the function in the ususal way)
Now apply split_at_tab function over the elements of list data_clean and assign it to
variable named data_clean2.
import string
def remove_punctuations_and_lower(text):
"""Function that removes punctuations in a text
Inputs:
text: string
Outputs:
text_wo_punctuations
"""
return (text.translate(str.maketrans("","", string.punctuation))).lower(
)
1.6
Now use the function remove_punctuations to remove punctuations from the text part of all of
the tuples in data_clean2 and assign it to a variable named dataset
Now let us count number of occurences of ham and spam in our dataset.
1.7
Counter returns a dictionary whose keys are u1,u2,…etc - unique values of type u obtained by
applying f over elements of l.
The values corresponding to the keys are the the number of times a particular key say u1 is
obtained when we apply f over elements of l
Outputs:
count_dict: dictionary; keys are elements of type u, values are ints
"""
### Write your code here
return count_dict
1.8
Write a function named aux_func that can be passed to counter along with the list dataset to
get a dictionary containing counts of ham and spam
The counts of ham and spam as we can see are {'ham': 4827, 'spam': 747}
Now let us split our dataset into training and test sets. We'll first shuffle the elements of the
dataset, then we'll use 80% of data for training and 20% for testing.
1.9
Write a function that takes a list, randomly shuffles it and then returns it.
Hint: Use the random library of python - https://docs.python.org/3/library/random.html
def random_shuffle(l):
"""Function that returns a randomly shuffled list
Inputs:
l: list
Outputs:
l_shuffled: list, contains same elements as l but randomly shuffled
"""
### Write your code here
return l_shuffled
1.10
Now split the shuffled list. Take 80% (4459) samples and assign them to a variable called
data_train . Put the rest in a variable called data_test
2.Data Modeling
We shall use Naive Bayes for modelling our classifier. You can read about Naive Bayes from
here (https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes). But you
don't actually need to read it, because we are going to move step by step in building this
classifier.
P~(wi|C)=Number of occurences of wi in all samples of class C+1Total number of words in all samples of class
C + Vocabulary size
2.1
Find the vocabulary - list of unique words in all smses of data_train and assign it to the
variable vocab
2.2
For every word wi in vocab, find the count (total number of occurences) of wi in all smses of
type spam. Put these counts in a dictionary and assign it to a variable named dict_spam where
key is the word wi and value is the count.
In a similar way, create a variable called dict_ham which contains counts of each word in
vocabulary in smses of type ham. (This is only w.r.t samples in data_train)
2.3
For every word wi in vocab, find the smoothed probability P~(wi| spam ) and put in a
dictionary named dict_prob_spam. In a similar way, define the dictionary dict_prob_ham
which contains smoothed probabilities P~(wi| ham )
3. Prediction
We need to test our model on data_test . For each sample of data_test, prediction procedure
is as follows:
For all words common to the sample and vocabulary, find spam_score and ham_score
If spam_score is higher than ham_score, then we predict the sample to be spam and vice
versa.
spam_score = P(spam)∗P~(w1| spam )∗P~(w2| spam )∗… where w1,w2,… are
words which occur both in the test sms and vocabulary.
Similary, ham_score = P(ham)∗P~(w1| ham )∗P~(w2| ham )∗… where w1,w2,…
are words which occur both in the test sms and vocabulary.
Here P(spam)=Number of samples of type spam in training setTotal number of samples in training set
Similarly, P(ham)=Number of samples of type ham in training setTotal number of samples in training
set
(Note: The above is prediction procedure for a single sample in data_test)
Write a function predict which does this.
3.1
def predict(text, dict_prob_spam, dict_prob_ham, data_train):
"""Function which predicts the label of the sms
Inputs:
text: string, sms
dict_prob_spam: dictionary, contains dict_prob_spam as defined above
dict_prob_spam: dictionary, contains dict_prob_ham as defined above
data_train: list, list of tuples of type(label, sms), contains trainin
g dataset
Outputs:
prediction: string, one of two strings - either 'spam' or 'ham'
"""
### Write your code here
return prediction
3.2
Now find accuracy of the model. Apply function predict to all the samples in data_test.
accuracy=number of correct predictionssize of test set
Write the function accuracy which applies predict to all samples in data_test and returns
accuracy
Outputs:
accuracy: float, value of accuracy
"""
### Write your code here
return accuracy