0% found this document useful (0 votes)
7 views9 pages

DM Theory Mid Term

A Exam paper on data mining with solution

Uploaded by

AY S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

DM Theory Mid Term

A Exam paper on data mining with solution

Uploaded by

AY S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

COMSATS University Islamabad, Wah Campus

Mid-Term Examination Fall 2024


Department of Computer Science

Program(s)/Classes: BCS 5 (D,E) Date: 01st-November-2024


Subject: Data Mining (DSC 306) Maximum Marks: 50 Marks
Instructor Name(s): Mr. Mian M. Talha Time Allowed: 1.5 Hrs

Important Guidelines:
1. Please attempt all questions on the separate answer sheets provided, following the correct
sequence. Questions answered out of order will not be graded.
2. The exchange of calculators is strictly prohibited. If you have any assumptions in mind while
attempting to answer the questions, please state them in advance.
3. Please refrain from attempting to cheat, as most of the questions are based on even and
odd registration numbers.
4. Carefully read the question scenarios before beginning your responses.

Scenario: Netflix Series Viewing Dataset


In the competitive landscape of streaming services, understanding viewer preferences and
behaviors is crucial for content development and marketing strategies. The Netflix dataset presents
an overview of various popular series, highlighting their genres, viewer demographics, ratings, and
overall performance. As a data analyst at Netflix, you are tasked with enhancing Netflix's content
strategy, you are provided with a comprehensive dataset detailing various popular series available
on the platform. This dataset includes crucial information such as series names, genres, viewer
demographics, ratings, total views, watch time, countries, and current series statuses.
Consider the Netflix dataset provided on page 03. You are required to apply the data mining
processes and techniques studied throughout the course. Questions 1 and 2 are specifically
designed to assess your understanding and application of these techniques in relation to the Netflix
dataset.

Question # 01(a): (CLO-1, SO-1, Understanding) (05 Marks)


Data Mining Process: Classify the following steps into the data mining process of Data Preparation
and Data Modeling.
Noise reduction, Classification, Empty data removal, transformation, clustering
Data Preparation Data Modeling
Noise reduction
Classification
Empty data removal
Clustering
Transformation

Question # 01(b): (CLO-1, SO-1, Understanding) (08 Marks)


Data Attributes and Types: Refer to the Netflix dataset on page 03 and categorize the
attributes/features based on the following data types. Provide justifications where necessary.
Even Registration # Odd Registration #
I. Nominal I. Ordinal
II. Time Series II. Multi variate
III. Discrete III. Continuous

Even Registration
1. Nominal
o Series ID: Acts as a unique identifier without any numerical meaning.
Page 1 of 9
o Series Name: Categorical, representing different names of series.
o Genre: Categorical, indicating different genres of series.
o Country: Categorical, representing various countries of origin.
o Series Status: Categorical, denoting the series completion status (e.g., "Ongoing,"
"Completed").
2. Time Series
o There is no attribute directly fitting the time series data type in this dataset. Time
series data typically involves measurements taken over specific time intervals, which
is not explicitly present here.
3. Discrete
o Viewer Age: In most cases, age is discrete when it represents whole numbers (like
years).
o Total Views: If considered as countable whole numbers (excluding formatting errors
like "1,000,000"), it can be viewed as discrete.

Odd Registration
1. Ordinal
o Rating: If ratings are whole numbers (e.g., 1, 2, 3) that signify a rank or order, they
are classified as ordinal. The order matters, but the exact differences between the
numbers are not meaningful. Reflects a relative scale, where a higher rating
indicates a better ranking or review, suitable for an ordinal classification.
o Viewer Age (if we consider age groups or ranges instead of exact values): Could
represent a level of maturity or content suitability in an ordered way.
2. Multivariate
o The dataset itself is multivariate, involving multiple attributes that could be analyzed
simultaneously (e.g., Genre, Viewer Age, Rating, etc.).
3. Continuous
o Rating: If ratings include decimal values (e.g., 10.3, 2.3, 5.6), they are classified as
continuous. This classification indicates that the ratings can take any value within a
range, and the differences between values are meaningful.
o Watch Time (hrs): As time can be fractional and take on any value, it is continuous.
o Total Views: If formatted to remove commas and decimal points, this could be
continuous depending on the scale of measurement.

Question # 2 (a): (CLO-2, SO-2&4, Applying) (08 Marks)


Data Preprocessing: Refer to the Netflix dataset on page 03 and prepare the data by identifying
the following issues and anomalies present within it. Please list all identified anomalies.
Even Registration # Odd Registration #
I. Outliers I. Noisy Data
II. Duplicate Data II. Inconsistent Data
III. Incorrect Data Types III. Logical Errors
Page 2 of 9
Even Registration
1. Outliers
o Rating for "Queen's Gambit" (15.0): Ratings generally range from 0 to 10, making
15.0 an outlier.
o Watch Time (hrs) for "Queen's Gambit" (-5.0): Watch time should be positive; a
negative value is an outlier.
2. Duplicate Data
o Stranger Things (Series ID 001 and 015) appears twice with identical details except
for the genre ("Sci-Fi" vs. "Fantasy"), indicating potential duplicate entries with minor
genre inconsistencies.
o The Witcher (Series ID 007, 017, and 020) appears three times with similar details
(rating, watch time, country, and status) but slightly different genres ("Fantasy" vs.
"Adventure").
3. Incorrect Data Types
o Viewer Age for "13 Reasons Why" ("Twenty-One"): Should be numerical, but it’s
written in text format.
o Rating for "13 Reasons Why" ("Eight"): Should be numeric but is entered as text.
o Total Views: Some entries include commas (e.g., "1,000,000"), causing them to be
treated as strings instead of integers.
o Watch Time (hrs) for "La Casa de Papel" (value is "?"): Should be a numeric value,
but contains a "?" placeholder.

Odd Registration
1. Noisy Data
o Viewer Age for "Money Heist," "Bridgerton," "Emily in Paris," and "Breaking Bad":
Contains missing values ("NaN" or "-"), adding noise to the dataset.
o Country for "Bridgerton": Missing value ("NaN"), contributing to noise.

2. Inconsistent Data
o Genre for "Breaking Bad" ("Crime/Drama"): Most series have a single genre listed,
while "Breaking Bad" has a combined genre format, leading to inconsistency.
o Viewer Age: Inconsistencies in data representation, with some ages provided
numerically (e.g., 16, 18) and others in text form (e.g., "Twenty-One" for "13 Reasons
Why").
o Total Views: Formatting inconsistency, as some entries include commas (e.g.,
"1,000,000") while others do not (e.g., "90000").

3. Logical Errors
o Rating of 15.0 for "Queen's Gambit": Exceeds the typical 0–10 range, which is
logically incorrect.
o Watch Time (hrs) of -5.0 for "Queen's Gambit": Negative watch time is logically
impossible.

Page 3 of 9
o Viewer Age ("Twenty-One" and "Eight") and Rating ("Eight" for "13 Reasons Why"):
Text representations for numerical values are logically inconsistent.

Question # 2 (b): (CLO-2, SO-2&4, Applying) (06 Marks)


Data Transformation: Refer to the Netflix dataset on page 03 and apply data transformation
techniques to address and manage any issues related to impure data.
Even Registration # Odd Registration #
Normalize the attribute Rating into new Normalize the attribute Watch Time into new
range of 1 to 5 stars of Squid Game and range of 10 to 50 average of Narcos and
The Witcher only. The Crown only.

Even Registration
Normalization of Rating (for Squid Game and The Witcher):
Formula:

Odd Registration
Normalization of Watch Time (for Narcos and The Crown):
Formula:

Page 4 of 9
Question # 2 (c): (CLO-2, SO-2&4, Applying) (08 Marks)
Data Smoothing: Refer to the Netflix dataset on page 03. Perform data smoothing using
the following techniques:
Even Registration # Odd Registration #
• Ignore only the noisy or empty data • Ignore only the noisy or empty data
• Make FOUR Bins • Make FOUR Bins
• Perform smoothing by bin means • Perform smoothing by bin boundaries
on the attribute Viewer Age on the attribute Viewer Age

Data Transformation and Smoothing for Viewer Age


1. Convert Inconsistent Data:
o Transform non-numeric entries in the Viewer Age attribute to numerical values.
▪ "Twenty-One" ➔ 21
2. Ignore Noisy or Empty Data:
o Remove rows where Viewer Age is missing (NaN or "-").
o Ignored entries:
▪ Money Heist (NaN)
▪ Emily in Paris ("-")
After transformation, the Viewer Age values are:
16, 23, 18, 21, 25, 21, 22, 24, 45, 34, 16, 20, 21, 22, 30, 21.

Even Registration
1. Make FOUR Bins:
o Sorted values: 16, 16, 18, 20, 21, 21, 21, 21, 22, 22, 23, 24, 25, 30, 34, 45.
o Divide into four equal bins:
▪ Bin 1: 16, 16, 18, 20
▪ Bin 2: 21, 21, 21, 21
▪ Bin 3: 22, 22, 23, 24
▪ Bin 4: 25, 30, 34, 45
2. Smoothing by Bin Means:
o Bin 1 Mean: (16 + 16 + 18 + 20) / 4 = 17.5
o Bin 2 Mean: (21 + 21 + 21 + 21) / 4 = 21
o Bin 3 Mean: (22 + 22 + 23 + 24) / 4 = 22.75
o Bin 4 Mean: (25 + 30 + 34 + 45) / 4 = 33.5
Smoothed Viewer Age values:
o Bin 1: 17.5, 17.5, 17.5, 17.5
o Bin 2: 21, 21, 21, 21
o Bin 3: 22.75, 22.75, 22.75, 22.75
o Bin 4: 33.5, 33.5, 33.5, 33.5

Page 5 of 9
Odd Registration
1. Make FOUR Bins:
o Sorted values (same as above): 16, 16, 18, 20, 21, 21, 21, 21, 22, 22, 23, 24, 25, 30,
34, 45.
o Bins:
▪ Bin 1: 16, 16, 18, 20
▪ Bin 2: 21, 21, 21, 21
▪ Bin 3: 22, 22, 23, 24
▪ Bin 4: 25, 30, 34, 45
2. Smoothing by Bin Boundaries:
o Bin 1 Boundary Values: 16, 20
▪ Smoothed values: 16, 16, 16, 20
o Bin 2 Boundary Values: 21, 21
▪ Smoothed values: 21, 21, 21, 21
o Bin 3 Boundary Values: 22, 24
▪ Smoothed values: 22, 22, 24, 24
o Bin 4 Boundary Values: 25, 45
▪ Smoothed values: 25, 25, 34, 45
Smoothed Viewer Age values:
o Bin 1: 16, 16, 16, 20
o Bin 2: 21, 21, 21, 21
o Bin 3: 22, 22, 24, 24
o Bin 4: 25, 25, 34, 45

Question # 3 (a): (CLO-2, SO-2&4, Applying) (15 Marks)


Decision Tree Scenario:
You are a data analyst for a social media company aiming to predict whether a user will
spend more than 2 hours on a platform based on usage behavior. Your dataset has 14
records with attributes related to platform, activity level, content engagement, and
network strength, as well as a final decision on whether the user spent more than 2 hours
("Yes" or "No").
Using the ID3 algorithm with Entropy and Information Gain, You need to identify the
attribute that best fits as the root node of your decision tree.
Content Network Spent > 2
Day Platform Activity Level
Engagement Strength hours?
1 Instagram High Low Weak No
2 Instagram High Low Strong No
3 Twitter High Low Weak Yes
4 Facebook Moderate Low Weak Yes
5 Facebook Low Moderate Weak Yes
Page 6 of 9
6 Facebook Low Moderate Strong No
7 Twitter Low Moderate Strong Yes
8 Instagram Moderate Low Weak No
9 Instagram Low Moderate Weak Yes
10 Facebook Moderate Moderate Weak Yes
11 Instagram Moderate Moderate Strong Yes
12 Twitter Moderate Low Strong Yes
13 Twitter High Moderate Weak Yes
14 Facebook Moderate Low Strong No

Page 7 of 9
***GOOD LUCK DATA SCIENTISTS***
TRUE MARKS COME NOT FROM MEMORIZING ANSWERS BUT FROM MASTERING CONCEPTS.

Page 8 of 9
Series Total Watch
Series Name Genre Viewer Age Rating Country Series Status
ID Views Time (hrs)
001 Stranger Things Sci-Fi 16 9.2 230 8.5 USA Ongoing
002 Money Heist Thriller NaN 8.3 1,000,000 10.0 Spain Completed
003 Queen's Gambit Drama 23 15.0 1200 -5.0 UK Completed
004 DARK Sci-Fi 18 9.0 1500 9.9 Germany Ongoing

Netflix Series Viewing Dataset


005 13 Reasons Why Drama Twenty-One Eight NaN NaN USA Completed
006 Stranger Things Fantasy 25 9.1 NaN 7.0 USA Ongoing
007 The Witcher Fantasy 21 8.7 90000 10.0 Poland Ongoing

Q# 01 & Q# 02
008 Bridgerton Romance NaN 7.5 350 8.5 NaN Completed
009 Lucifer Thriller 22 9.3 700 5.5 USA Ongoing
010 Squid Game Thriller 24 9.6 4500 8.8 Korea Completed
011 Emily in Paris Comedy - NaN 80 3.3 France Ongoing
012 La Casa de Papel Thriller 45 8.4 NaN ? Spain Completed
013 Narcos Crime 34 8.8 5000 1.0 Colombia Ongoing
014 Breaking Bad Crime/Drama NaN 9.5 750 8.0 Mexico Completed
015 Stranger Things Sci-Fi 16 9.2 230 8.5 USA Ongoing
016 The Elite Drama 20 8.2 1100 6.5 Spain Completed
017 The Witcher Fantasy 21 8.7 90000 10.0 Poland Ongoing
018 The Vampire Diaries Fantasy 22 8.7 1200 6.5 USA Completed
019 The Crown Historical 30 8.8 6000 4.5 UK Completed
020 The Witcher Adventure 21 8.7 90000 10.0 Poland Ongoing

Page 9 of 9

You might also like