DM Theory Mid Term
DM Theory Mid Term
Important Guidelines:
1. Please attempt all questions on the separate answer sheets provided, following the correct
sequence. Questions answered out of order will not be graded.
2. The exchange of calculators is strictly prohibited. If you have any assumptions in mind while
attempting to answer the questions, please state them in advance.
3. Please refrain from attempting to cheat, as most of the questions are based on even and
odd registration numbers.
4. Carefully read the question scenarios before beginning your responses.
Even Registration
1. Nominal
o Series ID: Acts as a unique identifier without any numerical meaning.
Page 1 of 9
o Series Name: Categorical, representing different names of series.
o Genre: Categorical, indicating different genres of series.
o Country: Categorical, representing various countries of origin.
o Series Status: Categorical, denoting the series completion status (e.g., "Ongoing,"
"Completed").
2. Time Series
o There is no attribute directly fitting the time series data type in this dataset. Time
series data typically involves measurements taken over specific time intervals, which
is not explicitly present here.
3. Discrete
o Viewer Age: In most cases, age is discrete when it represents whole numbers (like
years).
o Total Views: If considered as countable whole numbers (excluding formatting errors
like "1,000,000"), it can be viewed as discrete.
Odd Registration
1. Ordinal
o Rating: If ratings are whole numbers (e.g., 1, 2, 3) that signify a rank or order, they
are classified as ordinal. The order matters, but the exact differences between the
numbers are not meaningful. Reflects a relative scale, where a higher rating
indicates a better ranking or review, suitable for an ordinal classification.
o Viewer Age (if we consider age groups or ranges instead of exact values): Could
represent a level of maturity or content suitability in an ordered way.
2. Multivariate
o The dataset itself is multivariate, involving multiple attributes that could be analyzed
simultaneously (e.g., Genre, Viewer Age, Rating, etc.).
3. Continuous
o Rating: If ratings include decimal values (e.g., 10.3, 2.3, 5.6), they are classified as
continuous. This classification indicates that the ratings can take any value within a
range, and the differences between values are meaningful.
o Watch Time (hrs): As time can be fractional and take on any value, it is continuous.
o Total Views: If formatted to remove commas and decimal points, this could be
continuous depending on the scale of measurement.
Odd Registration
1. Noisy Data
o Viewer Age for "Money Heist," "Bridgerton," "Emily in Paris," and "Breaking Bad":
Contains missing values ("NaN" or "-"), adding noise to the dataset.
o Country for "Bridgerton": Missing value ("NaN"), contributing to noise.
2. Inconsistent Data
o Genre for "Breaking Bad" ("Crime/Drama"): Most series have a single genre listed,
while "Breaking Bad" has a combined genre format, leading to inconsistency.
o Viewer Age: Inconsistencies in data representation, with some ages provided
numerically (e.g., 16, 18) and others in text form (e.g., "Twenty-One" for "13 Reasons
Why").
o Total Views: Formatting inconsistency, as some entries include commas (e.g.,
"1,000,000") while others do not (e.g., "90000").
3. Logical Errors
o Rating of 15.0 for "Queen's Gambit": Exceeds the typical 0–10 range, which is
logically incorrect.
o Watch Time (hrs) of -5.0 for "Queen's Gambit": Negative watch time is logically
impossible.
Page 3 of 9
o Viewer Age ("Twenty-One" and "Eight") and Rating ("Eight" for "13 Reasons Why"):
Text representations for numerical values are logically inconsistent.
Even Registration
Normalization of Rating (for Squid Game and The Witcher):
Formula:
Odd Registration
Normalization of Watch Time (for Narcos and The Crown):
Formula:
Page 4 of 9
Question # 2 (c): (CLO-2, SO-2&4, Applying) (08 Marks)
Data Smoothing: Refer to the Netflix dataset on page 03. Perform data smoothing using
the following techniques:
Even Registration # Odd Registration #
• Ignore only the noisy or empty data • Ignore only the noisy or empty data
• Make FOUR Bins • Make FOUR Bins
• Perform smoothing by bin means • Perform smoothing by bin boundaries
on the attribute Viewer Age on the attribute Viewer Age
Even Registration
1. Make FOUR Bins:
o Sorted values: 16, 16, 18, 20, 21, 21, 21, 21, 22, 22, 23, 24, 25, 30, 34, 45.
o Divide into four equal bins:
▪ Bin 1: 16, 16, 18, 20
▪ Bin 2: 21, 21, 21, 21
▪ Bin 3: 22, 22, 23, 24
▪ Bin 4: 25, 30, 34, 45
2. Smoothing by Bin Means:
o Bin 1 Mean: (16 + 16 + 18 + 20) / 4 = 17.5
o Bin 2 Mean: (21 + 21 + 21 + 21) / 4 = 21
o Bin 3 Mean: (22 + 22 + 23 + 24) / 4 = 22.75
o Bin 4 Mean: (25 + 30 + 34 + 45) / 4 = 33.5
Smoothed Viewer Age values:
o Bin 1: 17.5, 17.5, 17.5, 17.5
o Bin 2: 21, 21, 21, 21
o Bin 3: 22.75, 22.75, 22.75, 22.75
o Bin 4: 33.5, 33.5, 33.5, 33.5
Page 5 of 9
Odd Registration
1. Make FOUR Bins:
o Sorted values (same as above): 16, 16, 18, 20, 21, 21, 21, 21, 22, 22, 23, 24, 25, 30,
34, 45.
o Bins:
▪ Bin 1: 16, 16, 18, 20
▪ Bin 2: 21, 21, 21, 21
▪ Bin 3: 22, 22, 23, 24
▪ Bin 4: 25, 30, 34, 45
2. Smoothing by Bin Boundaries:
o Bin 1 Boundary Values: 16, 20
▪ Smoothed values: 16, 16, 16, 20
o Bin 2 Boundary Values: 21, 21
▪ Smoothed values: 21, 21, 21, 21
o Bin 3 Boundary Values: 22, 24
▪ Smoothed values: 22, 22, 24, 24
o Bin 4 Boundary Values: 25, 45
▪ Smoothed values: 25, 25, 34, 45
Smoothed Viewer Age values:
o Bin 1: 16, 16, 16, 20
o Bin 2: 21, 21, 21, 21
o Bin 3: 22, 22, 24, 24
o Bin 4: 25, 25, 34, 45
Page 7 of 9
***GOOD LUCK DATA SCIENTISTS***
TRUE MARKS COME NOT FROM MEMORIZING ANSWERS BUT FROM MASTERING CONCEPTS.
Page 8 of 9
Series Total Watch
Series Name Genre Viewer Age Rating Country Series Status
ID Views Time (hrs)
001 Stranger Things Sci-Fi 16 9.2 230 8.5 USA Ongoing
002 Money Heist Thriller NaN 8.3 1,000,000 10.0 Spain Completed
003 Queen's Gambit Drama 23 15.0 1200 -5.0 UK Completed
004 DARK Sci-Fi 18 9.0 1500 9.9 Germany Ongoing
Q# 01 & Q# 02
008 Bridgerton Romance NaN 7.5 350 8.5 NaN Completed
009 Lucifer Thriller 22 9.3 700 5.5 USA Ongoing
010 Squid Game Thriller 24 9.6 4500 8.8 Korea Completed
011 Emily in Paris Comedy - NaN 80 3.3 France Ongoing
012 La Casa de Papel Thriller 45 8.4 NaN ? Spain Completed
013 Narcos Crime 34 8.8 5000 1.0 Colombia Ongoing
014 Breaking Bad Crime/Drama NaN 9.5 750 8.0 Mexico Completed
015 Stranger Things Sci-Fi 16 9.2 230 8.5 USA Ongoing
016 The Elite Drama 20 8.2 1100 6.5 Spain Completed
017 The Witcher Fantasy 21 8.7 90000 10.0 Poland Ongoing
018 The Vampire Diaries Fantasy 22 8.7 1200 6.5 USA Completed
019 The Crown Historical 30 8.8 6000 4.5 UK Completed
020 The Witcher Adventure 21 8.7 90000 10.0 Poland Ongoing
Page 9 of 9