0% found this document useful (0 votes)
5 views

DM_Practice_Problem_Set-2

The document contains a set of practice problems for a course on Data Mining and Predictive Modeling. It covers various topics including attribute classification, encoding techniques, Principal Component Analysis, Information Gain, rule-based reasoning, and evaluation metrics for predictive models. Each problem requires analysis and application of data mining concepts to datasets and scenarios.

Uploaded by

Ching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DM_Practice_Problem_Set-2

The document contains a set of practice problems for a course on Data Mining and Predictive Modeling. It covers various topics including attribute classification, encoding techniques, Principal Component Analysis, Information Gain, rule-based reasoning, and evaluation metrics for predictive models. Each problem requires analysis and application of data mining concepts to datasets and scenarios.

Uploaded by

Ching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Practice Problems Set - 2

Data Mining and Predicative Modeling - CSET 228


AY 2024-25, Even Semester

1 Answer the following questions on the types of attributes in


the table 1.
Table presents a dataset with attributes of different types: Nominal, Binary, Ordinal, and Numeric.
Each column is color-coded for better understanding.
ID Name Gender (Nominal) Smoker (Binary) Education Level (Ordinal) Age (Numeric) Salary (Numeric)
1 Alice Female Yes Bachelor’s 28 50,000
2 Bob Male No Master’s 35 75,000
3 Charlie Non-binary Yes High School 22 30,000
4 David Male No PhD 40 90,000
5 Emma Female No Bachelor’s 26 55,000

Table 1: Dataset with Nominal, Binary, Ordinal, and Numeric Attributes

1. Which attributes in the table are Nominal?


2. Identify the Binary attributes and explain why they are classified as binary.
3. Which attribute is Ordinal, and how is it different from Nominal attributes?
4. List all Numeric attributes and explain how they are different from other types.
5. Why is Education Level considered Ordinal instead of Nominal?
6. If we want to use this dataset for a regression model predicting Salary, which attribute types
should be converted into numerical form?

2 A dataset contains the following attributes:


• Age (years)
• Height (cm)
• Blood Type (A, B, AB, O)
• Has Diabetes (Yes/No)
• Education Level (High School, Bachelor’s, Master’s, PhD)
Classify each attribute as Nominal, Ordinal, Binary, or Numeric and justify your classification.

3 A dataset of 1000 employees includes:


• Salary (in dollars)
• Job Title (Manager, Engineer, Clerk, etc.)
• Years of Experience
• Performance Rating (1 to 5)
• Gender (Male/Female)
1. Which attributes are qualitative (categorical) and which are quantitative (numeric)?
2. Convert the Job Title into a numerical representation suitable for machine learning.

1
4 Why do we often use one-hot encoding for Nominal attributes
but integer encoding for Ordinal attributes?
5 What is the key difference between Binary and Nominal at-
tributes? Can a binary attribute also be a nominal attribute?
Explain with an example.
6 If a dataset contains:
• Customer Age
• Product Category

• Purchase Amount
• Membership Status (Gold, Silver, Bronze)

1. Identify the data type of each attribute.


2. If we were to predict Purchase Amount, which attributes should be transformed, and how?

7 In a healthcare dataset, the Blood Pressure attribute is recorded


as (Low, Normal, High), now
1. Is this a Nominal or Ordinal attribute? Justify your answer.
2. If we want to use this data in a machine learning model, how should we encode it?

8 Dataset Example for Attribute Identification

ID Name Eye Color (Nominal) Income ($) (Numeric) Likes Spicy Food (Binary) Education Level (Ordinal)
1 John Brown 50,000 Yes Master’s
2 Emma Blue 70,000 No High School
3 Alex Green 40,000 Yes PhD

Table 2: Dataset with Different Attribute Types

1. Identify each attribute type (Nominal, Ordinal, Binary, or Numeric).


2. Suggest the best encoding techniques for data mining.

9 Explain the concept of vectorization in data mining. Why is


it necessary for text processing?
10 Given a corpus with the following three sentences:
• “Machine learning is fun”
• “Deep learning is powerful”

• “Machine learning is powerful”

Construct a Bag of Words representation for the corpus.

2
11 Convert the sentence “Data Science is awesome” into a one-
hot encoded representation using the vocabulary
{“Data”, “Science”, “is”, “awesome”, “Machine”, “learning”}

12 You are given the following dataset with three features


 
2 3 1
 3 4 2 
X=
 

 4 5 3 
5 6 4
Perform Principal Component Analysis (PCA) to reduce the dataset from 3D to 1D by following
these steps:

1. Compute the mean of each feature and center the dataset.


2. Compute the covariance matrix of the centered dataset.
3. Find the eigenvalues and eigenvectors of the covariance matrix.
4. Select the principal component with the highest eigenvalue.
5. Project the original 3D data onto the first principal component to obtain a 1D representation.

13 You are given a grayscale image represented as a 5 × 5 ma-


trix, apply Principal Component Analysis (PCA) to com-
press the image by reducing its dimensionality from 5D to
2D.
Given a 5 × 5 grayscale image matrix is:
 
255 200 180 170 160
200 180 170 160 150
 
 
 
X=
 180 170 160 150 140 

170 160 150 140 130
 
 
160 150 140 130 120

14 What is Information Gain (IG), and how is it used in feature


selection in the process of developing predictive models? Ex-
plain with an example.
15 Consider the following dataset with 5 attributes and a binary
target variable buy product (Yes/No). Using Information
Gain, determine the best feature for classification.

1. Compute the entropy of the target variable Buy Product.


The entropy formula is given by:
n
X
H(Y ) = − p(yi ) log2 p(yi )
i=1

3
Person Age Income Student Credit Score Buy Product
1 Young High No Fair No
2 Young High No Excellent No
3 Middle High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle Low Yes Excellent Yes
8 Young Medium No Fair No
9 Young Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Young Medium Yes Excellent Yes
12 Middle Medium No Excellent Yes
13 Middle High Yes Fair Yes
14 Senior Medium No Excellent No

Table 3: Dataset for Information Gain Calculation

2. Calculate the Information Gain for each feature: Age, Income, Student, and Credit Score.
The Information Gain is given by:

IG(X) = H(Y ) − H(Y |X)

where H(Y |X) is the conditional entropy.


3. Identify the best feature for splitting based on the highest Information Gain.

16 A rule in rule-based reasoning is often written in the form:


IF condition(s) THEN action, Explain the components of a rule and provide an example of how such
rules are used in data mining for classification.

17 Consider the following dataset with 5 attributes and a binary


target variable (Buy Product: Yes/No). Use Rule-Based
Reasoning to extract decision rules.

ID Age Income Student Credit Score Buy Product


1 Young High No Fair No
2 Middle Low Yes Excellent Yes
3 Senior Medium No Fair Yes
4 Young Medium Yes Excellent Yes
5 Senior Low No Fair No

Table 4: Simple Table for Rule-Based Reasoning

1. Identify patterns in the dataset to derive classification rules.


2. Express these rules in the form:

IF condition(s) THEN decision

4
3. Extract at least two IF-THEN rules for predicting Buy Product.

18 Given the following rule-based classification system, classify


the new records based on the provided rules.
• Rule 1: IF Age = Young AND Student = Yes THEN Buy Product = Yes
• Rule 2: IF Credit Score = Fair AND Income = High THEN Buy Product = No
• Rule 3: IF Age = Senior AND Credit Score = Fair THEN Buy Product = No
Classify whether the person will Buy Product (Yes/No) for the following new records:

ID Age Income Student Credit Score Buy Product (Yes/No)?


1 Young Medium Yes Fair ?
2 Senior Low No Excellent ?
3 Middle High Yes Fair ?

Table 5: New Records for Classification

1. Apply the given rules to classify each record.


2. Fill in the Buy Product column with ”Yes” or ”No” based on the matching rule.
3. If no rule directly applies, justify your classification decision.

19 Consider a transaction dataset from an online store. Using


support and confidence, generate two association rules with
a minimum support of 50% and minimum confidence of 70%.

Transaction ID Items Bought


1 Milk, Bread, Butter
2 Milk, Butter
3 Bread, Butter
4 Milk, Bread
5 Milk, Bread, Butter

Table 6: Transaction Data for Apriori Algorithm

1. Compute the support for individual items and frequent itemsets.


2. Identify frequent itemsets that satisfy the minimum support threshold of 50%.
3. Calculate the confidence for candidate association rules.
4. Generate two strong association rules that meet both support and confidence thresholds.

20 Given two vectors in a 5-dimensional spaces


A = (1, 3, 7, 5, 9), B = (4, 7, 2, 8, 6)
Compute the Minkowski Distance for:
1. p = 1.5
2. p = 3

5
21 The table below represents sales performance ($1000s) and
marketing expenses ($1000s) over six months:

Month Sales ($1000s) Marketing Spend ($1000s)


Jan 50 8
Feb 60 12
Mar 55 10
Apr 70 18
May 65 15
Jun 80 20

Table 7: Sales vs Marketing Spend Data

1. Compute the Pearson Correlation Coefficient r between sales and marketing spend.

2. Interpret whether marketing spend has a linear impact on sales.

22 Two documents contain the following sets of unique words:


D1 = {AI, Machine, Learning, Data, Science, Algorithm, Model}
D2 = {AI, Data, Science, Deep, Neural, Learning, Network}

1. Compute the Jaccard Similarity between these documents.


2. If a threshold of 0.5 is required for similarity, determine if these documents are similar.

23 In a medical image segmentation task, two binary masks


represent detected regions of tumors:
M1 = [1, 1, 0, 1, 1, 0, 1, 1, 0, 1]
M2 = [1, 0, 1, 1, 1, 1, 0, 1, 1, 0]

1. Compute the Dice Coefficient to measure the similarity of segmentation.

2. If a Dice score above 0.75 is considered a good segmentation, does the model perform well?

24 Two probability distributions represent word frequencies in


two different language models
P = {0.15, 0.35, 0.25, 0.10, 0.15}, Q = {0.20, 0.30, 0.30, 0.10, 0.10}

1. Compute the Kullback-Leibler (KL) Divergence DKL (P ||Q).


2. Explain whether the two distributions are significantly different.

6
25 A new predictive model for tuberculosis detection test was
applied to 300 patients. The results are as follows:
• 120 patients actually had tuberculosis.
• Out of those 120, the model correctly identified 90 as positive.
• The model incorrectly classified 30 tuberculosis patients as negative (false negatives).
• Out of 180 healthy patients, the model correctly classified 150 as negative.

• The model incorrectly classified 30 healthy patients as positive (false positives).

1. Construct the Confusion Matrix.


2. Compute:
• Accuracy
• Precision
• Recall
• Specificity
• F1-score

You might also like