0% found this document useful (0 votes)

7 views9 pages

DM Theory Mid Term

A Exam paper on data mining with solution

Uploaded by

AY S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views9 pages

DM Theory Mid Term

A Exam paper on data mining with solution

Uploaded by

AY S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

COMSATS University Islamabad, Wah Campus

Mid-Term Examination Fall 2024

Department of Computer Science

Program(s)/Classes: BCS 5 (D,E) Date: 01st-November-2024

Subject: Data Mining (DSC 306) Maximum Marks: 50 Marks
Instructor Name(s): Mr. Mian M. Talha Time Allowed: 1.5 Hrs

Important Guidelines:
1. Please attempt all questions on the separate answer sheets provided, following the correct
sequence. Questions answered out of order will not be graded.
2. The exchange of calculators is strictly prohibited. If you have any assumptions in mind while
attempting to answer the questions, please state them in advance.
3. Please refrain from attempting to cheat, as most of the questions are based on even and
odd registration numbers.
4. Carefully read the question scenarios before beginning your responses.

Scenario: Netflix Series Viewing Dataset

In the competitive landscape of streaming services, understanding viewer preferences and
behaviors is crucial for content development and marketing strategies. The Netflix dataset presents
an overview of various popular series, highlighting their genres, viewer demographics, ratings, and
overall performance. As a data analyst at Netflix, you are tasked with enhancing Netflix's content
strategy, you are provided with a comprehensive dataset detailing various popular series available
on the platform. This dataset includes crucial information such as series names, genres, viewer
demographics, ratings, total views, watch time, countries, and current series statuses.
Consider the Netflix dataset provided on page 03. You are required to apply the data mining
processes and techniques studied throughout the course. Questions 1 and 2 are specifically
designed to assess your understanding and application of these techniques in relation to the Netflix
dataset.

Question # 01(a): (CLO-1, SO-1, Understanding) (05 Marks)

Data Mining Process: Classify the following steps into the data mining process of Data Preparation
and Data Modeling.
Noise reduction, Classification, Empty data removal, transformation, clustering
Data Preparation Data Modeling
Noise reduction
Classification
Empty data removal
Clustering
Transformation

Question # 01(b): (CLO-1, SO-1, Understanding) (08 Marks)

Data Attributes and Types: Refer to the Netflix dataset on page 03 and categorize the
attributes/features based on the following data types. Provide justifications where necessary.
Even Registration # Odd Registration #
I. Nominal I. Ordinal
II. Time Series II. Multi variate
III. Discrete III. Continuous

Even Registration
1. Nominal
o Series ID: Acts as a unique identifier without any numerical meaning.
Page 1 of 9
o Series Name: Categorical, representing different names of series.
o Genre: Categorical, indicating different genres of series.
o Country: Categorical, representing various countries of origin.
o Series Status: Categorical, denoting the series completion status (e.g., "Ongoing,"
"Completed").
2. Time Series
o There is no attribute directly fitting the time series data type in this dataset. Time
series data typically involves measurements taken over specific time intervals, which
is not explicitly present here.
3. Discrete
o Viewer Age: In most cases, age is discrete when it represents whole numbers (like
years).
o Total Views: If considered as countable whole numbers (excluding formatting errors
like "1,000,000"), it can be viewed as discrete.

Odd Registration
1. Ordinal
o Rating: If ratings are whole numbers (e.g., 1, 2, 3) that signify a rank or order, they
are classified as ordinal. The order matters, but the exact differences between the
numbers are not meaningful. Reflects a relative scale, where a higher rating
indicates a better ranking or review, suitable for an ordinal classification.
o Viewer Age (if we consider age groups or ranges instead of exact values): Could
represent a level of maturity or content suitability in an ordered way.
2. Multivariate
o The dataset itself is multivariate, involving multiple attributes that could be analyzed
simultaneously (e.g., Genre, Viewer Age, Rating, etc.).
3. Continuous
o Rating: If ratings include decimal values (e.g., 10.3, 2.3, 5.6), they are classified as
continuous. This classification indicates that the ratings can take any value within a
range, and the differences between values are meaningful.
o Watch Time (hrs): As time can be fractional and take on any value, it is continuous.
o Total Views: If formatted to remove commas and decimal points, this could be
continuous depending on the scale of measurement.

Question # 2 (a): (CLO-2, SO-2&4, Applying) (08 Marks)

Data Preprocessing: Refer to the Netflix dataset on page 03 and prepare the data by identifying
the following issues and anomalies present within it. Please list all identified anomalies.
Even Registration # Odd Registration #
I. Outliers I. Noisy Data
II. Duplicate Data II. Inconsistent Data
III. Incorrect Data Types III. Logical Errors
Page 2 of 9
Even Registration
1. Outliers
o Rating for "Queen's Gambit" (15.0): Ratings generally range from 0 to 10, making
15.0 an outlier.
o Watch Time (hrs) for "Queen's Gambit" (-5.0): Watch time should be positive; a
negative value is an outlier.
2. Duplicate Data
o Stranger Things (Series ID 001 and 015) appears twice with identical details except
for the genre ("Sci-Fi" vs. "Fantasy"), indicating potential duplicate entries with minor
genre inconsistencies.
o The Witcher (Series ID 007, 017, and 020) appears three times with similar details
(rating, watch time, country, and status) but slightly different genres ("Fantasy" vs.
"Adventure").
3. Incorrect Data Types
o Viewer Age for "13 Reasons Why" ("Twenty-One"): Should be numerical, but it’s
written in text format.
o Rating for "13 Reasons Why" ("Eight"): Should be numeric but is entered as text.
o Total Views: Some entries include commas (e.g., "1,000,000"), causing them to be
treated as strings instead of integers.
o Watch Time (hrs) for "La Casa de Papel" (value is "?"): Should be a numeric value,
but contains a "?" placeholder.

Odd Registration
1. Noisy Data
o Viewer Age for "Money Heist," "Bridgerton," "Emily in Paris," and "Breaking Bad":
Contains missing values ("NaN" or "-"), adding noise to the dataset.
o Country for "Bridgerton": Missing value ("NaN"), contributing to noise.

2. Inconsistent Data
o Genre for "Breaking Bad" ("Crime/Drama"): Most series have a single genre listed,
while "Breaking Bad" has a combined genre format, leading to inconsistency.
o Viewer Age: Inconsistencies in data representation, with some ages provided
numerically (e.g., 16, 18) and others in text form (e.g., "Twenty-One" for "13 Reasons
Why").
o Total Views: Formatting inconsistency, as some entries include commas (e.g.,
"1,000,000") while others do not (e.g., "90000").

3. Logical Errors
o Rating of 15.0 for "Queen's Gambit": Exceeds the typical 0–10 range, which is
logically incorrect.
o Watch Time (hrs) of -5.0 for "Queen's Gambit": Negative watch time is logically
impossible.

Page 3 of 9
o Viewer Age ("Twenty-One" and "Eight") and Rating ("Eight" for "13 Reasons Why"):
Text representations for numerical values are logically inconsistent.

Question # 2 (b): (CLO-2, SO-2&4, Applying) (06 Marks)

Data Transformation: Refer to the Netflix dataset on page 03 and apply data transformation
techniques to address and manage any issues related to impure data.
Even Registration # Odd Registration #
Normalize the attribute Rating into new Normalize the attribute Watch Time into new
range of 1 to 5 stars of Squid Game and range of 10 to 50 average of Narcos and
The Witcher only. The Crown only.

Even Registration
Normalization of Rating (for Squid Game and The Witcher):
Formula:

Odd Registration
Normalization of Watch Time (for Narcos and The Crown):
Formula:

Page 4 of 9
Question # 2 (c): (CLO-2, SO-2&4, Applying) (08 Marks)
Data Smoothing: Refer to the Netflix dataset on page 03. Perform data smoothing using
the following techniques:
Even Registration # Odd Registration #
• Ignore only the noisy or empty data • Ignore only the noisy or empty data
• Make FOUR Bins • Make FOUR Bins
• Perform smoothing by bin means • Perform smoothing by bin boundaries
on the attribute Viewer Age on the attribute Viewer Age

Data Transformation and Smoothing for Viewer Age

1. Convert Inconsistent Data:
o Transform non-numeric entries in the Viewer Age attribute to numerical values.
▪ "Twenty-One" ➔ 21
2. Ignore Noisy or Empty Data:
o Remove rows where Viewer Age is missing (NaN or "-").
o Ignored entries:
▪ Money Heist (NaN)
▪ Emily in Paris ("-")
After transformation, the Viewer Age values are:
16, 23, 18, 21, 25, 21, 22, 24, 45, 34, 16, 20, 21, 22, 30, 21.

Even Registration
1. Make FOUR Bins:
o Sorted values: 16, 16, 18, 20, 21, 21, 21, 21, 22, 22, 23, 24, 25, 30, 34, 45.
o Divide into four equal bins:
▪ Bin 1: 16, 16, 18, 20
▪ Bin 2: 21, 21, 21, 21
▪ Bin 3: 22, 22, 23, 24
▪ Bin 4: 25, 30, 34, 45
2. Smoothing by Bin Means:
o Bin 1 Mean: (16 + 16 + 18 + 20) / 4 = 17.5
o Bin 2 Mean: (21 + 21 + 21 + 21) / 4 = 21
o Bin 3 Mean: (22 + 22 + 23 + 24) / 4 = 22.75
o Bin 4 Mean: (25 + 30 + 34 + 45) / 4 = 33.5
Smoothed Viewer Age values:
o Bin 1: 17.5, 17.5, 17.5, 17.5
o Bin 2: 21, 21, 21, 21
o Bin 3: 22.75, 22.75, 22.75, 22.75
o Bin 4: 33.5, 33.5, 33.5, 33.5

Page 5 of 9
Odd Registration
1. Make FOUR Bins:
o Sorted values (same as above): 16, 16, 18, 20, 21, 21, 21, 21, 22, 22, 23, 24, 25, 30,
34, 45.
o Bins:
▪ Bin 1: 16, 16, 18, 20
▪ Bin 2: 21, 21, 21, 21
▪ Bin 3: 22, 22, 23, 24
▪ Bin 4: 25, 30, 34, 45
2. Smoothing by Bin Boundaries:
o Bin 1 Boundary Values: 16, 20
▪ Smoothed values: 16, 16, 16, 20
o Bin 2 Boundary Values: 21, 21
▪ Smoothed values: 21, 21, 21, 21
o Bin 3 Boundary Values: 22, 24
▪ Smoothed values: 22, 22, 24, 24
o Bin 4 Boundary Values: 25, 45
▪ Smoothed values: 25, 25, 34, 45
Smoothed Viewer Age values:
o Bin 1: 16, 16, 16, 20
o Bin 2: 21, 21, 21, 21
o Bin 3: 22, 22, 24, 24
o Bin 4: 25, 25, 34, 45

Question # 3 (a): (CLO-2, SO-2&4, Applying) (15 Marks)

Decision Tree Scenario:
You are a data analyst for a social media company aiming to predict whether a user will
spend more than 2 hours on a platform based on usage behavior. Your dataset has 14
records with attributes related to platform, activity level, content engagement, and
network strength, as well as a final decision on whether the user spent more than 2 hours
("Yes" or "No").
Using the ID3 algorithm with Entropy and Information Gain, You need to identify the
attribute that best fits as the root node of your decision tree.
Content Network Spent > 2
Day Platform Activity Level
Engagement Strength hours?
1 Instagram High Low Weak No
2 Instagram High Low Strong No
3 Twitter High Low Weak Yes
4 Facebook Moderate Low Weak Yes
5 Facebook Low Moderate Weak Yes
Page 6 of 9
6 Facebook Low Moderate Strong No
7 Twitter Low Moderate Strong Yes
8 Instagram Moderate Low Weak No
9 Instagram Low Moderate Weak Yes
10 Facebook Moderate Moderate Weak Yes
11 Instagram Moderate Moderate Strong Yes
12 Twitter Moderate Low Strong Yes
13 Twitter High Moderate Weak Yes
14 Facebook Moderate Low Strong No

Page 7 of 9
***GOOD LUCK DATA SCIENTISTS***
TRUE MARKS COME NOT FROM MEMORIZING ANSWERS BUT FROM MASTERING CONCEPTS.

Page 8 of 9
Series Total Watch
Series Name Genre Viewer Age Rating Country Series Status
ID Views Time (hrs)
001 Stranger Things Sci-Fi 16 9.2 230 8.5 USA Ongoing
002 Money Heist Thriller NaN 8.3 1,000,000 10.0 Spain Completed
003 Queen's Gambit Drama 23 15.0 1200 -5.0 UK Completed
004 DARK Sci-Fi 18 9.0 1500 9.9 Germany Ongoing

Netflix Series Viewing Dataset

005 13 Reasons Why Drama Twenty-One Eight NaN NaN USA Completed
006 Stranger Things Fantasy 25 9.1 NaN 7.0 USA Ongoing
007 The Witcher Fantasy 21 8.7 90000 10.0 Poland Ongoing

Q# 01 & Q# 02
008 Bridgerton Romance NaN 7.5 350 8.5 NaN Completed
009 Lucifer Thriller 22 9.3 700 5.5 USA Ongoing
010 Squid Game Thriller 24 9.6 4500 8.8 Korea Completed
011 Emily in Paris Comedy - NaN 80 3.3 France Ongoing
012 La Casa de Papel Thriller 45 8.4 NaN ? Spain Completed
013 Narcos Crime 34 8.8 5000 1.0 Colombia Ongoing
014 Breaking Bad Crime/Drama NaN 9.5 750 8.0 Mexico Completed
015 Stranger Things Sci-Fi 16 9.2 230 8.5 USA Ongoing
016 The Elite Drama 20 8.2 1100 6.5 Spain Completed
017 The Witcher Fantasy 21 8.7 90000 10.0 Poland Ongoing
018 The Vampire Diaries Fantasy 22 8.7 1200 6.5 USA Completed
019 The Crown Historical 30 8.8 6000 4.5 UK Completed
020 The Witcher Adventure 21 8.7 90000 10.0 Poland Ongoing

Page 9 of 9

Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab
No ratings yet
Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab
9 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
A Collection of Algebraic Identities
75% (4)
A Collection of Algebraic Identities
658 pages
Determining Spot Heights From Contours
0% (1)
Determining Spot Heights From Contours
13 pages
RE Paper
No ratings yet
RE Paper
25 pages
Technical Documenetflix Technicalnt
No ratings yet
Technical Documenetflix Technicalnt
15 pages
Business Intelligence Project Report
No ratings yet
Business Intelligence Project Report
14 pages
Unit 3 Data Exploration (P)
No ratings yet
Unit 3 Data Exploration (P)
69 pages
Netflix Analysis Report (2105878 - Bibhudutta Swain)
No ratings yet
Netflix Analysis Report (2105878 - Bibhudutta Swain)
19 pages
Ads - Phase 5
No ratings yet
Ads - Phase 5
14 pages
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
No ratings yet
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
12 pages
Datascience Pepar
No ratings yet
Datascience Pepar
9 pages
18BCS053
No ratings yet
18BCS053
17 pages
Lec 5
No ratings yet
Lec 5
24 pages
Quiz l5
No ratings yet
Quiz l5
3 pages
Recommender System
No ratings yet
Recommender System
45 pages
Unit 3 Data Exploration (P)
No ratings yet
Unit 3 Data Exploration (P)
69 pages
Case Study Data Analytics
No ratings yet
Case Study Data Analytics
12 pages
Data Mining
No ratings yet
Data Mining
6 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Predicting Favourite TV Show
No ratings yet
Predicting Favourite TV Show
9 pages
SQL Proj
No ratings yet
SQL Proj
16 pages
CS F415 Data Mining Data Preprocessing
No ratings yet
CS F415 Data Mining Data Preprocessing
103 pages
PRACTICAL LIST CLASS-XII (INFO. PRACTICALS - fINAL PDF
100% (1)
PRACTICAL LIST CLASS-XII (INFO. PRACTICALS - fINAL PDF
8 pages
Netflix Data Analysis Project
No ratings yet
Netflix Data Analysis Project
16 pages
Project Movielense Solution
29% (7)
Project Movielense Solution
4 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Netflix Movies and TV Shows Clustering
No ratings yet
Netflix Movies and TV Shows Clustering
29 pages
Full
No ratings yet
Full
367 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
21CS63 - Unit1 Practice Questions
No ratings yet
21CS63 - Unit1 Practice Questions
3 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Project Movielense Solution
No ratings yet
Project Movielense Solution
4 pages
UNIT02
No ratings yet
UNIT02
41 pages
Data Mining For Exam
No ratings yet
Data Mining For Exam
10 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Powerbi Questions
No ratings yet
Powerbi Questions
2 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
The Bellkor 2008 Solution To The Netflix Prize
No ratings yet
The Bellkor 2008 Solution To The Netflix Prize
21 pages
Data Mining Summary (Final)
No ratings yet
Data Mining Summary (Final)
10 pages
Satish Excel
No ratings yet
Satish Excel
19 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
1st Harvard Project
No ratings yet
1st Harvard Project
17 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Unit 4
No ratings yet
Unit 4
66 pages
BDA University Question Paper
No ratings yet
BDA University Question Paper
10 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
ML ExpNo2 10355
No ratings yet
ML ExpNo2 10355
9 pages
Unit 2 1 Feature Sampling Normalization
No ratings yet
Unit 2 1 Feature Sampling Normalization
43 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Assignment Question Oct 2024
No ratings yet
Assignment Question Oct 2024
3 pages
DM-I Q Paper 2024
No ratings yet
DM-I Q Paper 2024
12 pages
NM Assignment
No ratings yet
NM Assignment
14 pages
Ids Past Papers Merged
No ratings yet
Ids Past Papers Merged
62 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
100 Puzzles to Learn Data Warehousing
From Everand
100 Puzzles to Learn Data Warehousing
Cristian Scutaru
No ratings yet
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
Manish Soni
No ratings yet
Sentinel Web Services Installation Guide
No ratings yet
Sentinel Web Services Installation Guide
33 pages
Rom vs. Ram
No ratings yet
Rom vs. Ram
8 pages
VLSI Testing: 18-322 Fall 2003
No ratings yet
VLSI Testing: 18-322 Fall 2003
33 pages
FRU (Field Replaceable Unit) List
No ratings yet
FRU (Field Replaceable Unit) List
9 pages
Seven Failure Points When Engineering A Retrieval Augmented Generation System
No ratings yet
Seven Failure Points When Engineering A Retrieval Augmented Generation System
6 pages
Complex Digital Signal Processing in Telecommunications
No ratings yet
Complex Digital Signal Processing in Telecommunications
23 pages
CS 3440 Graded Quiz Unit 6
No ratings yet
CS 3440 Graded Quiz Unit 6
7 pages
Prepositions of Place - My Room
100% (1)
Prepositions of Place - My Room
1 page
Authorized Distributor Price List: Part # Picture Product Description Precio Lista U.S.$
No ratings yet
Authorized Distributor Price List: Part # Picture Product Description Precio Lista U.S.$
11 pages
WIRES-X Connection Kit HRI-200 (Includes New DG-ID Feature) Instruction Manual
No ratings yet
WIRES-X Connection Kit HRI-200 (Includes New DG-ID Feature) Instruction Manual
109 pages
Natoreit Profile
No ratings yet
Natoreit Profile
7 pages
Teamwork Hotel Preopening Tasks Setupmyhotel
No ratings yet
Teamwork Hotel Preopening Tasks Setupmyhotel
108 pages
HX Je
100% (1)
HX Je
1 page
Choose An OTA For The Apple Watch Series 3 (42mm) IPSW Downloads
No ratings yet
Choose An OTA For The Apple Watch Series 3 (42mm) IPSW Downloads
1 page
HCIA-HarmonyOS Device Developer V1.0 学员用书
No ratings yet
HCIA-HarmonyOS Device Developer V1.0 学员用书
166 pages
Im Ax4co 3 PR PDF
No ratings yet
Im Ax4co 3 PR PDF
60 pages
System Requirements Autodesk Autocad 2021
No ratings yet
System Requirements Autodesk Autocad 2021
3 pages
Abdulrahman El Moughrabi Resume
No ratings yet
Abdulrahman El Moughrabi Resume
2 pages
PL 900
No ratings yet
PL 900
14 pages
Email List
No ratings yet
Email List
27 pages
The Cine
No ratings yet
The Cine
35 pages
Basics of Data Analysis and Graphics in
No ratings yet
Basics of Data Analysis and Graphics in
103 pages
Verification Academy Patterns Library: Pattern Name: The BFM-Proxy Pair Pattern
No ratings yet
Verification Academy Patterns Library: Pattern Name: The BFM-Proxy Pair Pattern
5 pages
CoreJAVA Practicals
No ratings yet
CoreJAVA Practicals
2 pages
Assignment Guidelines-July'24 Session
No ratings yet
Assignment Guidelines-July'24 Session
2 pages
Lesson Plan
50% (2)
Lesson Plan
7 pages
Class 9 Question Paper New
No ratings yet
Class 9 Question Paper New
8 pages
78R-13 - Original Baseline Schedule Review
100% (3)
78R-13 - Original Baseline Schedule Review
21 pages

DM Theory Mid Term

Uploaded by

DM Theory Mid Term

Uploaded by

COMSATS University Islamabad, Wah Campus

Mid-Term Examination Fall 2024

Program(s)/Classes: BCS 5 (D,E) Date: 01st-November-2024

Scenario: Netflix Series Viewing Dataset

Question # 01(a): (CLO-1, SO-1, Understanding) (05 Marks)

Question # 01(b): (CLO-1, SO-1, Understanding) (08 Marks)

Question # 2 (a): (CLO-2, SO-2&4, Applying) (08 Marks)

Question # 2 (b): (CLO-2, SO-2&4, Applying) (06 Marks)

Data Transformation and Smoothing for Viewer Age

Question # 3 (a): (CLO-2, SO-2&4, Applying) (15 Marks)

Netflix Series Viewing Dataset

You might also like