ASSIGNMENT 3 - Probabilistic Models, GBDT, SVM

This assignment covers Support Vector Machines, Naive Bayes classifiers, and Gradient Boosting. For Support Vector Machines, students are asked to classify real and fake news articles, performing text preprocessing, feature extraction, and evaluating SVM performance with different vocabularies. For Naive Bayes, students must classify a test example and prove a decision rule. For Gradient Boosting, students must perform diabetes classification using XGBoost with varying hyperparameters and report test accuracies.

Uploaded by

thecoolguy96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views3 pages

ASSIGNMENT 3 - Probabilistic Models, GBDT, SVM

Uploaded by

thecoolguy96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

ASSIGNMENT 3: SVM, Probabilistic Models, Boosting

Full Marks: 100

In this assignment we will cover the following concepts taught in the class:
1. Support Vector Machine
2. Naive Bayes Classifier
3. Gradient Boosting

Q1. Support Vector Machine (45)

Problem Statement: Train a Support Vector Machine model to detect fake news articles!

Data Set Description:

Download the dataset here: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-
dataset

The dataset contains four fields: title of the article, article text, subject of the article, and date
of the article. Download both files Fake.csv and True.csv and combine the two datasets with
the appropriate class label.

Assignment Tasks: In this assignment, you can scikit-learn SVM package to classify the
above data set. You have to study the performance of the SVM algorithms. You have to
randomly pick 70% of the data set as training data and the remaining as test data. You have
to submit a report in pdf format. The report should contain the following sections:
1. Clean the text with the following preprocessing steps: a) Removal of Punctuations b)
Removal of Stopwords c) Stemming using NLTK PorterStemmer d) Lemmatization
using NLTK WordNetLemmatizer e) Removal of URLs f) Removal of HTML Tags (5)
2. Use count vectorizer to generate training feature vector. Create vocabulary from the
count vectorizer (using training data). (5)
3. Generate training feature vector using the training data and the vocabulary. Train a
Linear SVM on the training feature vector. (5)
4. Generate test feature vector using test data and the above vocabulary. Report test
accuracy using the above trained linear SVM model. (5)
5. Reduce the vocabulary to ⅔ of the original vocabulary and generate train and test
feature vectors using the reduced vocabulary and training and test data respectively.
(5)
6. Train another linear SVM model with the new train feature vector and report test
accuracy on the new test feature vector. (10)
7. Effect of data cleaning: using the raw train and test data (before preprocessing),
generate training and test features from the above count vectorizer and train a linear
SVM model on the train feature vector, report test accuracy on the test feature vector
to analyze the effect of data preprocessing. (10)

Submission Guidelines: You should name your report file as (e.g., 18CS72P07_1.pdf). The
submitted report file should be in pdf and have the following header comments. # Roll # Name
# Assignment number. Also, submit the program file as (e.g., 18CS72P07_1.py)
Q2. Naive Bayes Classifier: (20)
a. Imagine that you are given the following set of training examples. Each feature can
take on one of three nominal values: a, b, or c.

F1 F2 F3 Category

a c a 1

c a c 1

a a c 0

b c a 0

c c b 0

How would a Naive Bayes system classify the following test example?
F1 = a, F2 = c, F3 = b (10)

b. Consider an example where X1, X2, and X3 are all Boolean features and Y is a Boolean
label. X1 and X2 are truly independent given Y and X3 is a copy of X2 (meaning that X3
and X2 always have the same value). Suppose you are now given a test example with
X1 = T and X2 = X3 = F. You are also given the probabilities:
P(X1 = T|Y = T) = p
P(X1 = T|Y = F) = 1 − p
P(X2 = F|Y = T) = q
P(X2 = F|Y = F) = 1 − q
P(Y = T) = P(Y = F) = 0.5
Prove that the Naive Bayes decision rule for classifying the test example positively is:
p ≥ (1 − q)2 / (q2 + (1 − q)2) (10)

Submission Guidelines: You should name your report file as (e.g., 18CS72P07_2.pdf). The
submitted report file should be in pdf and have the following header comments. # Roll # Name
# Assignment number.

Q3. Gradient Boosting (35)

Problem Statement: Diabetes classification using XGBoost Classifier.

Dataset: Please download the dataset from the link:
https://www.dropbox.com/s/c1q3qix77hclbf2/diabetes.csv?dl=0

The dataset consists of 9 features like “Glucose”, “BloodPressure” etc. The target variable is
the field named as “Outcome” which holds 2 values, 0 or 1. Use a 80-20 train-test split for the
experiment.
Assignment Tasks: In this assignment, please use the XGBClassifier available under
XGBoost python package to perform classification on the Diabetes dataset. Please vary the
following parameters and report the test set accuracies for each combination below: (35)

1. Learning rate = 0.1, objective = logistic regression

2. Learning rate = 0.1, objective = hinge loss
3. Learning rate = 0.3, objective = logistic regression, max_depth = 2
4. Learning rate = 0.3, objective = logistic regression, max_depth = 8
5. Learning rate = 0.7, objective = logistic regression
6. Learning rate = 0.7, objective = hinge loss
7. Learning rate = 0.7, objective = hinge loss, max_depth = 8
8. Learning rate = 0.3, objective = logistic regression, L1 regularisation = 0.2, max_depth
=8
9. Learning rate = 0.3, objective = logistic regression, L2 regularisation = 0.2, max_depth
=8
10. Learning rate = 0.3, objective = logistic regression, split finding algorithm =
Approximation Algorithm (present in the original paper
https://arxiv.org/pdf/1603.02754.pdf )

Submission Guidelines: You should name your report file as (e.g., 18CS72P07_3.pdf). The
submitted report file should be in pdf and have the following header comments. # Roll # Name
# Assignment number. Also, submit the program file as (e.g., 18CS72P07_3.py)

The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
No ratings yet
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
145 pages
Case: American Airlines, Inc.: Revenue Management: Submitted To
No ratings yet
Case: American Airlines, Inc.: Revenue Management: Submitted To
5 pages
Homework3 Sol
No ratings yet
Homework3 Sol
5 pages
Homework 4
0% (1)
Homework 4
4 pages
Assignment 1 Solution
No ratings yet
Assignment 1 Solution
6 pages
AOD 1 (Derivative As Rate Measure Approximations and Errors)
No ratings yet
AOD 1 (Derivative As Rate Measure Approximations and Errors)
42 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Machine 2021 Jul-Dec
No ratings yet
Machine 2021 Jul-Dec
46 pages
School of Engineering: Lab Manual On Machine Learning Lab
No ratings yet
School of Engineering: Lab Manual On Machine Learning Lab
23 pages
105 Machine Learning Paper
No ratings yet
105 Machine Learning Paper
6 pages
ML With Python Practical
No ratings yet
ML With Python Practical
22 pages
Midterm
No ratings yet
Midterm
12 pages
Unit V
No ratings yet
Unit V
67 pages
Indefinite+Integration+6 Art of Problem Solving
No ratings yet
Indefinite+Integration+6 Art of Problem Solving
56 pages
Prateek Gupta Resume
No ratings yet
Prateek Gupta Resume
3 pages
United International University: Department of Computer Science and Engineering
No ratings yet
United International University: Department of Computer Science and Engineering
3 pages
AOD 4 Increasing+&+Decreasing+Functions+ +
No ratings yet
AOD 4 Increasing+&+Decreasing+Functions+ +
42 pages
Midterm Solutions For Machine Learning
No ratings yet
Midterm Solutions For Machine Learning
13 pages
Ai Fall-23 Assignment
No ratings yet
Ai Fall-23 Assignment
5 pages
DSCI 303: Machine Learning For Data Science Fall 2020
No ratings yet
DSCI 303: Machine Learning For Data Science Fall 2020
5 pages
Ism Research Assessment 3
No ratings yet
Ism Research Assessment 3
27 pages
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
No ratings yet
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
5 pages
20MEMECH Part 3 - Classification
No ratings yet
20MEMECH Part 3 - Classification
49 pages
Machine Learning Foundations and Applications Assignment 1 Due Date: 10 October, 2021
No ratings yet
Machine Learning Foundations and Applications Assignment 1 Due Date: 10 October, 2021
3 pages
Answer 1722791857 NLP and Classification Practical MCQ 4991
No ratings yet
Answer 1722791857 NLP and Classification Practical MCQ 4991
26 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
Practical # 11
No ratings yet
Practical # 11
10 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
ML Lab Experiments (1) - Pages-3
No ratings yet
ML Lab Experiments (1) - Pages-3
11 pages
NB Assignment Instructions
No ratings yet
NB Assignment Instructions
13 pages
AIL Quiz Loc
No ratings yet
AIL Quiz Loc
33 pages
Lecture3 Linear Classifiers
No ratings yet
Lecture3 Linear Classifiers
36 pages
American Airlines Q4&5
No ratings yet
American Airlines Q4&5
2 pages
Fake News - 01
No ratings yet
Fake News - 01
5 pages
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
No ratings yet
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
6 pages
Assignment - 01
No ratings yet
Assignment - 01
4 pages
Assignment 2 Specification
No ratings yet
Assignment 2 Specification
3 pages
2 Classification
No ratings yet
2 Classification
38 pages
AI and ML Lab Manual
No ratings yet
AI and ML Lab Manual
29 pages
Conic Section Drill
No ratings yet
Conic Section Drill
6 pages
ML Hota Assign4
No ratings yet
ML Hota Assign4
3 pages
ML Mid Sem Sep2023 Paper
No ratings yet
ML Mid Sem Sep2023 Paper
3 pages
AOD+ +3+ (Angle+of+Intersection+and+Mean+Value+Theorem)
No ratings yet
AOD+ +3+ (Angle+of+Intersection+and+Mean+Value+Theorem)
41 pages
# ELG 5255 Applied Machine Learning Fall 2020 # Quiz 1 (Bayesian Decision Theory)
No ratings yet
# ELG 5255 Applied Machine Learning Fall 2020 # Quiz 1 (Bayesian Decision Theory)
6 pages
ML Lab6
No ratings yet
ML Lab6
4 pages
CSE3008
No ratings yet
CSE3008
4 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
ECON 460202E006 MLforBI2 S23o
No ratings yet
ECON 460202E006 MLforBI2 S23o
5 pages
Imperfections in Performance Management Misalignment: Cases
No ratings yet
Imperfections in Performance Management Misalignment: Cases
4 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
PGDBA Exam Pattern and Strategy
No ratings yet
PGDBA Exam Pattern and Strategy
2 pages
Machine Learning 1
No ratings yet
Machine Learning 1
2 pages
EZI - Session 9,10
No ratings yet
EZI - Session 9,10
2 pages
PGDBA 2021 PI Toolkit
No ratings yet
PGDBA 2021 PI Toolkit
2 pages
What Is Analytics?: A Brief Overview
No ratings yet
What Is Analytics?: A Brief Overview
1 page
Cola Wars Links
No ratings yet
Cola Wars Links
1 page
Unit IV Naïve Bayes and Support Vector Machine
No ratings yet
Unit IV Naïve Bayes and Support Vector Machine
22 pages
Machine Learning Thesis
No ratings yet
Machine Learning Thesis
92 pages
COL774: Assignment 4 Naive Bayes & Collaborative Filtering: Released On: 2nd October, 2024
No ratings yet
COL774: Assignment 4 Naive Bayes & Collaborative Filtering: Released On: 2nd October, 2024
4 pages
Why Did Airlines Partner With
No ratings yet
Why Did Airlines Partner With
2 pages
Ex Eval 1
No ratings yet
Ex Eval 1
3 pages
A Review of Facial Thermography Assessment For Vit
No ratings yet
A Review of Facial Thermography Assessment For Vit
22 pages
Python
No ratings yet
Python
14 pages
Internship Report - Spoorthi
No ratings yet
Internship Report - Spoorthi
27 pages
Predicting Gold Prices: Megan Potoski
No ratings yet
Predicting Gold Prices: Megan Potoski
5 pages
Malnutrition Detection System
No ratings yet
Malnutrition Detection System
1 page
IR Prac 5
No ratings yet
IR Prac 5
3 pages
MSBD5001 WrittenAssignment2 2024F
No ratings yet
MSBD5001 WrittenAssignment2 2024F
5 pages
Scikit-Learn: Machine Learning in Python
No ratings yet
Scikit-Learn: Machine Learning in Python
6 pages
Machine Learning Fake News Blocking
No ratings yet
Machine Learning Fake News Blocking
14 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
8 pages
Loan Approval Prediction2
No ratings yet
Loan Approval Prediction2
72 pages
Data Science Project
No ratings yet
Data Science Project
25 pages
Exploring DeepDream and XAI Representations For Classifying Histological Images
No ratings yet
Exploring DeepDream and XAI Representations For Classifying Histological Images
17 pages
32096-Article Text-120143-1-10-20230327 PDF
No ratings yet
32096-Article Text-120143-1-10-20230327 PDF
15 pages
A Combined Machine Learning Algorithms and DEA Method For Measuring and
No ratings yet
A Combined Machine Learning Algorithms and DEA Method For Measuring and
25 pages
Zeraietal2023 NaturalResourcesResearch Mineralprospectivitymappingeritrea
No ratings yet
Zeraietal2023 NaturalResourcesResearch Mineralprospectivitymappingeritrea
31 pages
Final
No ratings yet
Final
26 pages
Exam Spring 10
No ratings yet
Exam Spring 10
10 pages
Giant Pile ML Problems
No ratings yet
Giant Pile ML Problems
56 pages
Model Paper - Applied Machine Learning
No ratings yet
Model Paper - Applied Machine Learning
3 pages
Anna Jakubczyk Gałczyńska A Proposed Machine
No ratings yet
Anna Jakubczyk Gałczyńska A Proposed Machine
8 pages
Final2019 Solutions
No ratings yet
Final2019 Solutions
23 pages
Ai Project File
No ratings yet
Ai Project File
11 pages
MLL
No ratings yet
MLL
2 pages
ISYE6740 Fall2024 HW4 Rubric
No ratings yet
ISYE6740 Fall2024 HW4 Rubric
5 pages
Purva Rawale - BDA Practical No 2
No ratings yet
Purva Rawale - BDA Practical No 2
9 pages
ML
No ratings yet
ML
3 pages
Detection of Adulteration of Kudzu Powder by Terahertz Time Domain Spectros
No ratings yet
Detection of Adulteration of Kudzu Powder by Terahertz Time Domain Spectros
8 pages
Survival Analysis of Thyroid Cancer Patients Using Machine Learning Algorithms
No ratings yet
Survival Analysis of Thyroid Cancer Patients Using Machine Learning Algorithms
13 pages
5 Text Multimodal NB
No ratings yet
5 Text Multimodal NB
2 pages
Dissertation On Fraud Detection Using Machine Learning
No ratings yet
Dissertation On Fraud Detection Using Machine Learning
92 pages
Final 2019
No ratings yet
Final 2019
15 pages
Ml-Unit 2-QB
No ratings yet
Ml-Unit 2-QB
6 pages
Heart Disease
No ratings yet
Heart Disease
6 pages
Machine Learning PYQ ALL (Pran Tehare)
No ratings yet
Machine Learning PYQ ALL (Pran Tehare)
18 pages
KNN-SVM Assignment
No ratings yet
KNN-SVM Assignment
4 pages
PL LAB 3 File
No ratings yet
PL LAB 3 File
56 pages
Pcos Review (Bharath)
No ratings yet
Pcos Review (Bharath)
3 pages
Quantum Kernel Methods Under Scrutiny: A Benchmarking Study: Jan Schnabel Marco Roth
No ratings yet
Quantum Kernel Methods Under Scrutiny: A Benchmarking Study: Jan Schnabel Marco Roth
36 pages
Zom SH Ict701
No ratings yet
Zom SH Ict701
10 pages
SVM Set3
No ratings yet
SVM Set3
6 pages