2023-24 AIML ML Mid-Semester Regular QP Anwer-Keys

The document discusses several data quality issues found in a sample dataset and how they can be resolved. It also contains questions regarding classification models, logistic regression, and decision trees. For question 1, it identifies issues like inconsistent data types, missing values, and outliers. It suggests resolving them by data cleaning, imputation, and transformation techniques. Question 2 discusses bias and variance in regression models. Question 3 tests knowledge of logistic regression. Question 4 evaluates a fraud detection model's performance. Question 5 involves building a decision tree classifier to predict literary work nominations.

Uploaded by

SabariMurugan Sivakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views4 pages

2023-24 AIML ML Mid-Semester Regular QP Anwer-Keys

Uploaded by

SabariMurugan Sivakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Q1.

a) You have been given a task to perform the data preprocessing of the data retrieved from
multiple sources, before you start applying the data mining task. Identify, (atleast 5) data quality
issues with the sample data set retrieved from the master data set. Suggest, how do you resolve
these quality issues (python code is not required)? [5]

BLOOD COVID-19
TXN-ID NAME AGE HEIGHT WEIGHT GROUP RESULT
T001 RAMA 45 145 62kg O+ve Positive
T002 SEETHA 43 168 45kg B+ve Negative
T003 Akbar 38 172 60kg Iam+ve Positive
T004 BIRBAL 45 168 52kg AB+ve Negative
T005 THenali 22 157 78kg B-ve 1
T006 Venkat 36 157 54kg O-ve Negative
T007 Rajuu 350 132 48kg O+ve Positive
T008 HARI 32 180 120lbs AB-ve Negative
T009 Inba 25 85kg O+ve 0
T010 SysUsr789 20 165 68kg O-ve Negative

The attribute value SysUsr789 for the Name in the given data (T010 record) is not consistent
with other names and it has alpha numeric when compared with other data types. (0.5)
This data quality issue can be resolved by replacing that field with right name/data type for
consistency. ( 0.5)
The Age 350 is the outlier in T007 record and Height for Inba is missing (T009) (0.5)
These data issues can be resolved by filling the mean value of age and height. (0.5)
There is a mismatch in the data type units in T008, the Weight Unit for Hari is 120lbs whereas
all other attributes are having Kg values. (0.5)
This is the data type issue and it can be done through data transformation by either manual or
automatic edits of erroneous data (0.5)
The blood group has different representation in T004 record, inconsistent format of Iam+ve is
being used in the blood group. (0.5)
This can be replaced with either NULL or by applying binning techniques (0.5)
Transaction id T005 has Covid Result-Representation Mismatch as 1 and in T009 it has 0,
instead of indicating positive and negative values (0.5).
This data quality issue can be solved by applying data transformation such as data smoothing to
make the simple changes as there are only two values which requires replacement. (0.5)

Q1. b) Your friend needs your help. She needs to classify job applications into good/bad
categories, and also to detect job applicants who lie in their applications using density estimation
to detect outliers. To meet these needs, do you recommend using a discriminative or generative
classifier? Why? [2 marks][1 mark each]

Ans – Generative classifier . Reason – for density estimation you should calculate P(x|y)
Q2. a) Q2. a) Suppose you are using Ridge Regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that the model
suffers from high bias or high variance? Justify your stance. In such a scenario, what steps
would you take? Should you increase the regularization hyperparameter, λ or reduce it?
Why? [3]

The model is likely underfitting the training dataset which means it has a high bias. [1]
You should try reducing hyperparameters. [1]

Higher values of the hyperparameter increase bias and reduce variance, while lower
values have the opposite effect. A too-small value might result in overfitting, while a too-
large value could lead to underfitting. Cross-validation techniques can be employed to
find the optimal value for λ. Striking the right balance is essential for achieving a
model that generalizes well to unseen data. [1 mark for justification]

Q2. b) Suppose you have been given a large dataset with n=2000000 instances and m(# of
features)=300000 for each instance. You are asked to use multivariate linear regression to fit
the θ parameters to our data. Which approach would you prefer, gradient descent or methods
of least square and Why? [3]
Gradient descent (1 mark).
Method of least square is very slow, if n is very large. Computing inverse is roughly
O(n^3) (2 marks for the explanation)

Q3. Suppose you are building a logistic regression model for the given dataset using gradient
descent approach. You managed to identify the theta parameters θ0, θ1 such that J(θ0,θ1)=0
where J(θ0, θ1) is the cost function. Which of the following statements (a-d) must be True?
Justify your answer in each case.
[6 marks]

a) The model will work perfectly well for the unseen/new instances without any error. It
will predict correct values of the target variable, Y. False, it’s an over fitted model.

b) If J(θ0,θ1)=0 for some values of θ0, and θ1, then Hθ(x(i))=y(i) for every training
example (x(i),y(i)). True

c) For J(θ0,θ1) to be 0, θ0, and θ1 must be 0. False, it is not necessary

d) J(θ0,θ1) cannot be 0. False. It can be 0.

[ 0.5 marks for True/ false. 1 mark for justification]

2. Explain the importance of feature scaling in learning model parameters, θ in logistic

regression. [ 2 marks]
When using Gradient descent, you should ensure that all features have a similar scale, otherwise,
it will take longer time to converge

Q4. Suppose we train a model to predict whether a credit card transaction is Fraudulent or Not.
After training the model, we apply it to a test set of 200 new transactions (also labelled) and the
model produces the contingency table below.

Predicted Class
Fraud Not Fraud
True Fraud 60 0
Class Not Fraud 120 20
List your crisp point-wise
observations on the classifier with supporting justification. (4 marks)

Q5. a) Use ID3 decision tree algorithm to train the classifier, find which of the features among
{Readership Base, Writer's Reputation spread in other countries} is best suited for "root node" in
the tree construction. Pictorially represent complete resultant decision tree. Show all the
calculations and round the values to four decimal scale as appropriate. [4 Marks]
Use case: Committee of experts convene every year to nominate literary works to become
eligible for the awarded of highest category by assessing the works on multiple parameters.
Below is one subset of such features. Categorizing the works provides a transparent &
streamlined way of nomination process. Quantified values of attributes are discretized in below
data. Build a machine learning model to classify if an original literary work of writers has
“High” or “Medium” or “Low” chances of nomination by the committee.

b) Justify the below statement with any plagiarism free example. [2 marks]
"Assessing the model performance of built decision tree classifier using only the training data
set is detrimental to the process."
----------------------------------------------------------------------------------------------------------------
a) (Both the below answer key must be accepted by the evaluators)
Answer Key-1 (if Log base “2” is used by students):
Class Entropy : 1.5
Entropy of feature 1”readers base”: 1.887, Gain : 0.3113
Entropy of feature 2 “writer reputation..”: 0.6667, Gain : 0.8333
Inference : “Writer’s reputation spread….” has the minimum entropy or maximum info gain and
hence it’s the selected root for decision tree building
Answer Key-2 (if Log base “3” is used by students for k=3 number of distinct classes for
normalization):
Class Entropy : 0.9464
Entropy of feature 1”readers base”: 0.75, Gain : 0.1964
Entropy of feature 2 “writer reputation..”: 0.4206, Gain : 0.5258
Inference : “Writer’s reputation spread….” has the minimum entropy or maximum info gain and
hence it’s the selected root for decision tree building

Marking Scheme:
1 mark: Entropy of class attribute
1 mark: Entropy or Information Gain of the 1st feature
1 mark: Entropy or Information Gain of the 2nd feature
0.5 mark: Correct choice of the root
0.5 mark: Final decision tree with leaves (labelled with majority voting) in depth 1, built with
chosen root.
Partial Marking: If none of the above are correct but correctly tried to implement the
algorithm= 1.5m

b) Answer Key:
Generic reason is applicable here. A fully grown decision tree(DT) with complex rules is more
prone to learn all the pattern in the training data. At the best case, accuracy of the DT most
likely will be 100% and is not the good criteria to measure the perf. Unseen test/validation data
is best for evaluation.
Marking Scheme:
1 mark: Correct reason
1 mark: Any sample dataset split to show that training & test (or validation) may be split with
diverse data distribution for illustration
Partial Marking: -None-

Data Science Project Report
43% (7)
Data Science Project Report
10 pages
Wa0030.
No ratings yet
Wa0030.
36 pages
NPTEL ML Assignment Week1
100% (4)
NPTEL ML Assignment Week1
5 pages
Introduction To Machine Learning IIT KGP Week 2
100% (1)
Introduction To Machine Learning IIT KGP Week 2
14 pages
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
Forest Fire Prediction Using Machine Learning
No ratings yet
Forest Fire Prediction Using Machine Learning
28 pages
Machine 2021 Jan-Apr
No ratings yet
Machine 2021 Jan-Apr
45 pages
MCQ Question
No ratings yet
MCQ Question
5 pages
R Packages For Machine Learning
No ratings yet
R Packages For Machine Learning
3 pages
Projects 2021 B4
No ratings yet
Projects 2021 B4
96 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Hronsky - Session 4 - Mineral Exploration Tactics
No ratings yet
Hronsky - Session 4 - Mineral Exploration Tactics
44 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Tripathi2016 PDF
No ratings yet
Tripathi2016 PDF
7 pages
MLvsMAP Merged
No ratings yet
MLvsMAP Merged
208 pages
Midterm Sol
No ratings yet
Midterm Sol
23 pages
ML June 2024
No ratings yet
ML June 2024
12 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
56 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
Decision Trees
No ratings yet
Decision Trees
88 pages
2022 Jan
No ratings yet
2022 Jan
37 pages
Machine 2021 Jul-Dec
No ratings yet
Machine 2021 Jul-Dec
46 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Dtree
No ratings yet
Dtree
101 pages
Machine 2020 Jul-Dec
No ratings yet
Machine 2020 Jul-Dec
45 pages
BT-2016 SEM-IV Project Report (Review 1)
No ratings yet
BT-2016 SEM-IV Project Report (Review 1)
42 pages
Section 3
No ratings yet
Section 3
29 pages
BT40816 Project Report
No ratings yet
BT40816 Project Report
34 pages
Developing Prediction Model of Loan Risk in Banks Using Data Mining
No ratings yet
Developing Prediction Model of Loan Risk in Banks Using Data Mining
9 pages
CS 229, Autumn 2017 Problem Set #2: Supervised Learning II
No ratings yet
CS 229, Autumn 2017 Problem Set #2: Supervised Learning II
6 pages
CS-31002 (ML) - CS End April 2025
No ratings yet
CS-31002 (ML) - CS End April 2025
19 pages
Tandom Forest
No ratings yet
Tandom Forest
6 pages
Sample QP For Mid-Semester Exam
No ratings yet
Sample QP For Mid-Semester Exam
5 pages
Epfl Machine Learning Final Exam 2021 Solutions
No ratings yet
Epfl Machine Learning Final Exam 2021 Solutions
21 pages
ML, DL Questions: Downloaded From
No ratings yet
ML, DL Questions: Downloaded From
4 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
11 pages
Artificial Intelligence Chapter 18 (Updated)
No ratings yet
Artificial Intelligence Chapter 18 (Updated)
19 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
Decision Tree and Related Techniques For Classification in Scalation
No ratings yet
Decision Tree and Related Techniques For Classification in Scalation
12 pages
FDS Viva
No ratings yet
FDS Viva
46 pages
Dhana Doc 1
No ratings yet
Dhana Doc 1
25 pages
2011 End Spring 2011 Computer Science Machine Learning
No ratings yet
2011 End Spring 2011 Computer Science Machine Learning
10 pages
ML 2023a Midsem Solution
No ratings yet
ML 2023a Midsem Solution
9 pages
Ai ML Exam - 1march 16 2022-Michael Magreola
No ratings yet
Ai ML Exam - 1march 16 2022-Michael Magreola
8 pages
ML 2024a QP Solution Full
No ratings yet
ML 2024a QP Solution Full
13 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Data Mining Methods: Data Pre-Processing: Prof. Dr. Christina Andersson
No ratings yet
Data Mining Methods: Data Pre-Processing: Prof. Dr. Christina Andersson
33 pages
Spring Mid Sem ML Evalution Scheme
No ratings yet
Spring Mid Sem ML Evalution Scheme
8 pages
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
No ratings yet
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
8 pages
Using Weka
No ratings yet
Using Weka
6 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
2022 Exam2 Solution
No ratings yet
2022 Exam2 Solution
10 pages
Machine Learning
No ratings yet
Machine Learning
27 pages
2024 Machine Learning
No ratings yet
2024 Machine Learning
8 pages
COMPSCI5014 1 Machine Learning (M) 201904
No ratings yet
COMPSCI5014 1 Machine Learning (M) 201904
7 pages
ML Questions
No ratings yet
ML Questions
6 pages
Lokesh T00691325
No ratings yet
Lokesh T00691325
5 pages
University of Edinburgh College of Science and Engineering School of Informatics
No ratings yet
University of Edinburgh College of Science and Engineering School of Informatics
5 pages
Midterm 2023 Fall
No ratings yet
Midterm 2023 Fall
8 pages
2023-24 AIML ML Mid-Semester Make-Up Answer-Keys
No ratings yet
2023-24 AIML ML Mid-Semester Make-Up Answer-Keys
6 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
Nptel Week 8
No ratings yet
Nptel Week 8
3 pages
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
No ratings yet
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
6 pages
S&UL Subjective Question Bank
No ratings yet
S&UL Subjective Question Bank
7 pages
ML Midsem 2018 Solutions
No ratings yet
ML Midsem 2018 Solutions
7 pages
Sample Final AI
No ratings yet
Sample Final AI
9 pages
Practice Questions
No ratings yet
Practice Questions
3 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Machine Learning PYQ 2021
No ratings yet
Machine Learning PYQ 2021
4 pages
ML Ese 031223 Openbook
No ratings yet
ML Ese 031223 Openbook
4 pages
hw3 Red
No ratings yet
hw3 Red
4 pages
Dsa - DK Question Paper
No ratings yet
Dsa - DK Question Paper
4 pages
HW 1
No ratings yet
HW 1
4 pages
CS725 2020 Quiz1
No ratings yet
CS725 2020 Quiz1
3 pages
HW 02
No ratings yet
HW 02
3 pages
Ferreira 2015
No ratings yet
Ferreira 2015
14 pages
Oaneoinae
No ratings yet
Oaneoinae
13 pages
Wahyudi 2021 J. Phys. Conf. Ser. 1830 012016
No ratings yet
Wahyudi 2021 J. Phys. Conf. Ser. 1830 012016
13 pages
Introduction To Machine Learning - Ecen 4122 - 2023
No ratings yet
Introduction To Machine Learning - Ecen 4122 - 2023
4 pages
Ain Shams University Faculty of Engineering
No ratings yet
Ain Shams University Faculty of Engineering
2 pages
Genetic Based ID3 Classification Algorithm Diagnosis and Prognosis of Oral Cancer
No ratings yet
Genetic Based ID3 Classification Algorithm Diagnosis and Prognosis of Oral Cancer
3 pages
Predictive Analytics: Course Syllabus
No ratings yet
Predictive Analytics: Course Syllabus
8 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Decision Trees Pohon Keputusan
No ratings yet
Decision Trees Pohon Keputusan
5 pages
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
100 Puzzles to Learn Data Warehousing
From Everand
100 Puzzles to Learn Data Warehousing
Cristian Scutaru
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet

2023-24 AIML ML Mid-Semester Regular QP Anwer-Keys

Uploaded by

2023-24 AIML ML Mid-Semester Regular QP Anwer-Keys

Uploaded by

Q1.

c) For J(θ0,θ1) to be 0, θ0, and θ1 must be 0. False, it is not necessary

d) J(θ0,θ1) cannot be 0. False. It can be 0.

[ 0.5 marks for True/ false. 1 mark for justification]

2. Explain the importance of feature scaling in learning model parameters, θ in logistic

You might also like