0% found this document useful (0 votes)

5 views

DM_Practice_Problem_Set-2

The document contains a set of practice problems for a course on Data Mining and Predictive Modeling. It covers various topics including attribute classification, encoding techniques, Principal Component Analysis, Information Gain, rule-based reasoning, and evaluation metrics for predictive models. Each problem requires analysis and application of data mining concepts to datasets and scenarios.

Uploaded by

Ching

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

DM_Practice_Problem_Set-2

Uploaded by

Ching

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Practice Problems Set - 2

Data Mining and Predicative Modeling - CSET 228

AY 2024-25, Even Semester

1 Answer the following questions on the types of attributes in

the table 1.
Table presents a dataset with attributes of different types: Nominal, Binary, Ordinal, and Numeric.
Each column is color-coded for better understanding.
ID Name Gender (Nominal) Smoker (Binary) Education Level (Ordinal) Age (Numeric) Salary (Numeric)
1 Alice Female Yes Bachelor’s 28 50,000
2 Bob Male No Master’s 35 75,000
3 Charlie Non-binary Yes High School 22 30,000
4 David Male No PhD 40 90,000
5 Emma Female No Bachelor’s 26 55,000

Table 1: Dataset with Nominal, Binary, Ordinal, and Numeric Attributes

1. Which attributes in the table are Nominal?

2. Identify the Binary attributes and explain why they are classified as binary.
3. Which attribute is Ordinal, and how is it different from Nominal attributes?
4. List all Numeric attributes and explain how they are different from other types.
5. Why is Education Level considered Ordinal instead of Nominal?
6. If we want to use this dataset for a regression model predicting Salary, which attribute types
should be converted into numerical form?

2 A dataset contains the following attributes:

• Age (years)
• Height (cm)
• Blood Type (A, B, AB, O)
• Has Diabetes (Yes/No)
• Education Level (High School, Bachelor’s, Master’s, PhD)
Classify each attribute as Nominal, Ordinal, Binary, or Numeric and justify your classification.

3 A dataset of 1000 employees includes:

• Salary (in dollars)
• Job Title (Manager, Engineer, Clerk, etc.)
• Years of Experience
• Performance Rating (1 to 5)
• Gender (Male/Female)
1. Which attributes are qualitative (categorical) and which are quantitative (numeric)?
2. Convert the Job Title into a numerical representation suitable for machine learning.

1
4 Why do we often use one-hot encoding for Nominal attributes
but integer encoding for Ordinal attributes?
5 What is the key difference between Binary and Nominal at-
tributes? Can a binary attribute also be a nominal attribute?
Explain with an example.
6 If a dataset contains:
• Customer Age
• Product Category

• Purchase Amount
• Membership Status (Gold, Silver, Bronze)

1. Identify the data type of each attribute.

2. If we were to predict Purchase Amount, which attributes should be transformed, and how?

7 In a healthcare dataset, the Blood Pressure attribute is recorded

as (Low, Normal, High), now
1. Is this a Nominal or Ordinal attribute? Justify your answer.
2. If we want to use this data in a machine learning model, how should we encode it?

8 Dataset Example for Attribute Identification

ID Name Eye Color (Nominal) Income ($) (Numeric) Likes Spicy Food (Binary) Education Level (Ordinal)
1 John Brown 50,000 Yes Master’s
2 Emma Blue 70,000 No High School
3 Alex Green 40,000 Yes PhD

Table 2: Dataset with Different Attribute Types

1. Identify each attribute type (Nominal, Ordinal, Binary, or Numeric).

2. Suggest the best encoding techniques for data mining.

9 Explain the concept of vectorization in data mining. Why is

it necessary for text processing?
10 Given a corpus with the following three sentences:
• “Machine learning is fun”
• “Deep learning is powerful”

• “Machine learning is powerful”

Construct a Bag of Words representation for the corpus.

2
11 Convert the sentence “Data Science is awesome” into a one-
hot encoded representation using the vocabulary
{“Data”, “Science”, “is”, “awesome”, “Machine”, “learning”}

12 You are given the following dataset with three features

 
2 3 1
 3 4 2 
X=
 

 4 5 3 
5 6 4
Perform Principal Component Analysis (PCA) to reduce the dataset from 3D to 1D by following
these steps:

1. Compute the mean of each feature and center the dataset.

2. Compute the covariance matrix of the centered dataset.
3. Find the eigenvalues and eigenvectors of the covariance matrix.
4. Select the principal component with the highest eigenvalue.
5. Project the original 3D data onto the first principal component to obtain a 1D representation.

13 You are given a grayscale image represented as a 5 × 5 ma-

trix, apply Principal Component Analysis (PCA) to com-
press the image by reducing its dimensionality from 5D to
2D.
Given a 5 × 5 grayscale image matrix is:
 
255 200 180 170 160
200 180 170 160 150
 
 
 
X=
 180 170 160 150 140 

170 160 150 140 130
 
 
160 150 140 130 120

14 What is Information Gain (IG), and how is it used in feature

selection in the process of developing predictive models? Ex-
plain with an example.
15 Consider the following dataset with 5 attributes and a binary
target variable buy product (Yes/No). Using Information
Gain, determine the best feature for classification.

1. Compute the entropy of the target variable Buy Product.

The entropy formula is given by:
n
X
H(Y ) = − p(yi ) log2 p(yi )
i=1

3
Person Age Income Student Credit Score Buy Product
1 Young High No Fair No
2 Young High No Excellent No
3 Middle High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle Low Yes Excellent Yes
8 Young Medium No Fair No
9 Young Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Young Medium Yes Excellent Yes
12 Middle Medium No Excellent Yes
13 Middle High Yes Fair Yes
14 Senior Medium No Excellent No

Table 3: Dataset for Information Gain Calculation

2. Calculate the Information Gain for each feature: Age, Income, Student, and Credit Score.
The Information Gain is given by:

IG(X) = H(Y ) − H(Y |X)

where H(Y |X) is the conditional entropy.

3. Identify the best feature for splitting based on the highest Information Gain.

16 A rule in rule-based reasoning is often written in the form:

IF condition(s) THEN action, Explain the components of a rule and provide an example of how such
rules are used in data mining for classification.

17 Consider the following dataset with 5 attributes and a binary

target variable (Buy Product: Yes/No). Use Rule-Based
Reasoning to extract decision rules.

ID Age Income Student Credit Score Buy Product

1 Young High No Fair No
2 Middle Low Yes Excellent Yes
3 Senior Medium No Fair Yes
4 Young Medium Yes Excellent Yes
5 Senior Low No Fair No

Table 4: Simple Table for Rule-Based Reasoning

1. Identify patterns in the dataset to derive classification rules.

2. Express these rules in the form:

IF condition(s) THEN decision

4
3. Extract at least two IF-THEN rules for predicting Buy Product.

18 Given the following rule-based classification system, classify

the new records based on the provided rules.
• Rule 1: IF Age = Young AND Student = Yes THEN Buy Product = Yes
• Rule 2: IF Credit Score = Fair AND Income = High THEN Buy Product = No
• Rule 3: IF Age = Senior AND Credit Score = Fair THEN Buy Product = No
Classify whether the person will Buy Product (Yes/No) for the following new records:

ID Age Income Student Credit Score Buy Product (Yes/No)?

1 Young Medium Yes Fair ?
2 Senior Low No Excellent ?
3 Middle High Yes Fair ?

Table 5: New Records for Classification

1. Apply the given rules to classify each record.

2. Fill in the Buy Product column with ”Yes” or ”No” based on the matching rule.
3. If no rule directly applies, justify your classification decision.

19 Consider a transaction dataset from an online store. Using

support and confidence, generate two association rules with
a minimum support of 50% and minimum confidence of 70%.

Transaction ID Items Bought

1 Milk, Bread, Butter
2 Milk, Butter
3 Bread, Butter
4 Milk, Bread
5 Milk, Bread, Butter

Table 6: Transaction Data for Apriori Algorithm

1. Compute the support for individual items and frequent itemsets.

2. Identify frequent itemsets that satisfy the minimum support threshold of 50%.
3. Calculate the confidence for candidate association rules.
4. Generate two strong association rules that meet both support and confidence thresholds.

20 Given two vectors in a 5-dimensional spaces

A = (1, 3, 7, 5, 9), B = (4, 7, 2, 8, 6)
Compute the Minkowski Distance for:
1. p = 1.5
2. p = 3

5
21 The table below represents sales performance ($1000s) and
marketing expenses ($1000s) over six months:

Month Sales ($1000s) Marketing Spend ($1000s)

Jan 50 8
Feb 60 12
Mar 55 10
Apr 70 18
May 65 15
Jun 80 20

Table 7: Sales vs Marketing Spend Data

1. Compute the Pearson Correlation Coefficient r between sales and marketing spend.

2. Interpret whether marketing spend has a linear impact on sales.

22 Two documents contain the following sets of unique words:

D1 = {AI, Machine, Learning, Data, Science, Algorithm, Model}
D2 = {AI, Data, Science, Deep, Neural, Learning, Network}

1. Compute the Jaccard Similarity between these documents.

2. If a threshold of 0.5 is required for similarity, determine if these documents are similar.

23 In a medical image segmentation task, two binary masks

represent detected regions of tumors:
M1 = [1, 1, 0, 1, 1, 0, 1, 1, 0, 1]
M2 = [1, 0, 1, 1, 1, 1, 0, 1, 1, 0]

1. Compute the Dice Coefficient to measure the similarity of segmentation.

2. If a Dice score above 0.75 is considered a good segmentation, does the model perform well?

24 Two probability distributions represent word frequencies in

two different language models
P = {0.15, 0.35, 0.25, 0.10, 0.15}, Q = {0.20, 0.30, 0.30, 0.10, 0.10}

1. Compute the Kullback-Leibler (KL) Divergence DKL (P ||Q).

2. Explain whether the two distributions are significantly different.

6
25 A new predictive model for tuberculosis detection test was
applied to 300 patients. The results are as follows:
• 120 patients actually had tuberculosis.
• Out of those 120, the model correctly identified 90 as positive.
• The model incorrectly classified 30 tuberculosis patients as negative (false negatives).
• Out of 180 healthy patients, the model correctly classified 150 as negative.

• The model incorrectly classified 30 healthy patients as positive (false positives).

1. Construct the Confusion Matrix.

2. Compute:
• Accuracy
• Precision
• Recall
• Specificity
• F1-score

DataMining - Workbook MCQ
No ratings yet
DataMining - Workbook MCQ
16 pages
Sheet With Answers
No ratings yet
Sheet With Answers
87 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
7 pages
Service Manual: DEH-3400R
No ratings yet
Service Manual: DEH-3400R
73 pages
Assignment Data Mining
No ratings yet
Assignment Data Mining
27 pages
Lec 5
No ratings yet
Lec 5
24 pages
To Students Data Mining Part-2 Sept 13_240913_160930
No ratings yet
To Students Data Mining Part-2 Sept 13_240913_160930
5 pages
2CSOE03-O_IR_December_2023 (2)
No ratings yet
2CSOE03-O_IR_December_2023 (2)
4 pages
Introduction To Data Mining Instructors Solution Manual 1st ed. Edition Tan download
100% (1)
Introduction To Data Mining Instructors Solution Manual 1st ed. Edition Tan download
48 pages
Comp 1942 finalExamQuestion-2016
No ratings yet
Comp 1942 finalExamQuestion-2016
11 pages
Business Analytics and Data Mining Modeling Using R
No ratings yet
Business Analytics and Data Mining Modeling Using R
6 pages
Assignment I
No ratings yet
Assignment I
4 pages
21CS63 - Unit1 Practice Questions
No ratings yet
21CS63 - Unit1 Practice Questions
3 pages
B._Sc._H_Computer_S_3OWYH6v
No ratings yet
B._Sc._H_Computer_S_3OWYH6v
6 pages
Introduction To Data Mining Instructors Solution Manual 1st ed. Edition Tan - Get the ebook instantly with just one click
100% (1)
Introduction To Data Mining Instructors Solution Manual 1st ed. Edition Tan - Get the ebook instantly with just one click
40 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Data Mining - Sem 3 - Assignment - 2
No ratings yet
Data Mining - Sem 3 - Assignment - 2
5 pages
Mid-Sem Model Answer 7
No ratings yet
Mid-Sem Model Answer 7
5 pages
Exam-dm1-121017-ans
No ratings yet
Exam-dm1-121017-ans
8 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Assignment-2 3
No ratings yet
Assignment-2 3
4 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
3-Random Projection and Compressed Sensing technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing technique-13-01-2025
84 pages
Ids Final Sol
No ratings yet
Ids Final Sol
16 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
HW_02
No ratings yet
HW_02
3 pages
DM Endsem 2023-1
No ratings yet
DM Endsem 2023-1
4 pages
Assignment2
No ratings yet
Assignment2
6 pages
2-Data_Preprocessing
No ratings yet
2-Data_Preprocessing
104 pages
DM 2023
No ratings yet
DM 2023
8 pages
Full
No ratings yet
Full
367 pages
Final Compre - Solutions - updated FoDS
No ratings yet
Final Compre - Solutions - updated FoDS
12 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
Data Mining f20 Practice Final Solutions
No ratings yet
Data Mining f20 Practice Final Solutions
8 pages
Data Mining Paer 2 Oct 12, 2024_241012_224522 (1)
No ratings yet
Data Mining Paer 2 Oct 12, 2024_241012_224522 (1)
13 pages
Comp 1942 finalExamQuestion-2019
No ratings yet
Comp 1942 finalExamQuestion-2019
14 pages
Data Preprocessing II
No ratings yet
Data Preprocessing II
21 pages
FAI Lecture - 27-9-2023 PDF
No ratings yet
FAI Lecture - 27-9-2023 PDF
7 pages
ERERER
No ratings yet
ERERER
1 page
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
Unit 1 Ganeshk e
No ratings yet
Unit 1 Ganeshk e
24 pages
Mid-Semester Regular Data Mining QP v1 PDF
No ratings yet
Mid-Semester Regular Data Mining QP v1 PDF
2 pages
B.Tech May2022 Comp CSPE-64 Sem4
No ratings yet
B.Tech May2022 Comp CSPE-64 Sem4
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
(Fall 2011) CS-402 Data Mining - Final Exam-SUB - v03
No ratings yet
(Fall 2011) CS-402 Data Mining - Final Exam-SUB - v03
6 pages
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
No ratings yet
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
4 pages
Final Exam Review
No ratings yet
Final Exam Review
6 pages
Data Mining
No ratings yet
Data Mining
6 pages
02 DataCategorization
No ratings yet
02 DataCategorization
41 pages
Data
No ratings yet
Data
84 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
1 Assignment
No ratings yet
1 Assignment
2 pages
AIML_UNIT-4
No ratings yet
AIML_UNIT-4
82 pages
NASHEEEEYYYYYY
No ratings yet
NASHEEEEYYYYYY
30 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Data Similarity
0% (1)
Data Similarity
18 pages
TY - Descriptive Analytics - DEC 2018
No ratings yet
TY - Descriptive Analytics - DEC 2018
4 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
5 pages
Trackpad Pro Ver. 5.0 Class 7
From Everand
Trackpad Pro Ver. 5.0 Class 7
Nidhi Arora
5/5 (1)
Itco Oil Spill
No ratings yet
Itco Oil Spill
25 pages
About Project DOG
No ratings yet
About Project DOG
10 pages
Lesson Plan LLT
No ratings yet
Lesson Plan LLT
3 pages
Day 17 NAT and PAT
No ratings yet
Day 17 NAT and PAT
18 pages
Part One
No ratings yet
Part One
86 pages
Document MH 53-10-05
No ratings yet
Document MH 53-10-05
8 pages
Lab-Activity-1-Common-Laboratory-Apparatus-and-Glasswares
No ratings yet
Lab-Activity-1-Common-Laboratory-Apparatus-and-Glasswares
5 pages
Biodata-Template-Philippines
No ratings yet
Biodata-Template-Philippines
2 pages
Bahan Ajar Kapita Selekta Kimia SMA 2 Halogen Alkana
No ratings yet
Bahan Ajar Kapita Selekta Kimia SMA 2 Halogen Alkana
6 pages
9061 902 Edited
No ratings yet
9061 902 Edited
6 pages
Near Miss 02.09.17 SCG
100% (1)
Near Miss 02.09.17 SCG
3 pages
Meetron1 LEC ME 2022 ME
No ratings yet
Meetron1 LEC ME 2022 ME
4 pages
(eBook PDF) Essentials of Understanding Psychology 12th Edition by Robert Feldman 2024 Scribd Download
100% (1)
(eBook PDF) Essentials of Understanding Psychology 12th Edition by Robert Feldman 2024 Scribd Download
55 pages
MPU2193 PCI Quiz
No ratings yet
MPU2193 PCI Quiz
9 pages
journalsresaim,+IJRESM V3 I8 45
No ratings yet
journalsresaim,+IJRESM V3 I8 45
9 pages
Digital Clock Project
40% (5)
Digital Clock Project
20 pages
Introducing Tracet: Enterprise Fixed Asset Management Software
No ratings yet
Introducing Tracet: Enterprise Fixed Asset Management Software
13 pages
5 Detectives at Work: Doesn't Live (Not Live) There
No ratings yet
5 Detectives at Work: Doesn't Live (Not Live) There
10 pages
Chapter 5 - Road Materials
No ratings yet
Chapter 5 - Road Materials
72 pages
Nur Afina BT Anuari 2020453788
No ratings yet
Nur Afina BT Anuari 2020453788
2 pages
G 12 U 1 Answered
No ratings yet
G 12 U 1 Answered
14 pages
House Plans, 2 Story House Plans, 40 X 40 House Plans, 100121
No ratings yet
House Plans, 2 Story House Plans, 40 X 40 House Plans, 100121
6 pages
Rat Ka Ant Rough
No ratings yet
Rat Ka Ant Rough
136 pages
Carbonyl Compound
No ratings yet
Carbonyl Compound
8 pages
Vengeance Relics
No ratings yet
Vengeance Relics
7 pages
Textbook Ebook Thermal Mechanical and Hybrid Chemical Energy Storage Systems Klaus Brun Editor All Chapter PDF
100% (8)
Textbook Ebook Thermal Mechanical and Hybrid Chemical Energy Storage Systems Klaus Brun Editor All Chapter PDF
43 pages
Intel Branded Matrix Q2 FY23
No ratings yet
Intel Branded Matrix Q2 FY23
23 pages
System Analysis and Design Assignment (CSC 311)
No ratings yet
System Analysis and Design Assignment (CSC 311)
15 pages
Final Assignment M1: Learning Activity 1
No ratings yet
Final Assignment M1: Learning Activity 1
2 pages