DA Answer-Key
DA Answer-Key
Q1. In reviewing metadata that describes a dataset's variables, you notice several variables are
marked as "derived." What does this indicate about those variables? (0.5)
Q2. You compare two histograms: Histogram X is wider with more spread in data, and
Histogram Y is narrower. What can you infer about the variability of data in each histogram?
(0.5)
1. ** Histogram X shows higher variability compared to Histogram Y
2. Histogram Y shows higher variability compared to Histogram X
3. Both histograms have identical variability
4. Histogram X and Histogram Y have no data variability
Q3. Analyze how the presence of multiple duplicate entries of numerical values in a dataset
impacts the calculation of the median (0.5)
1. The median is always one of the duplicate values
2. The median is the average of the duplicate values
3. ** The median is unaffected by the duplicates
4. The median cannot be determined if duplicates are present.
Q4. Given the following data set: 4, 8, 8, 10, 12, 14, 14, 14, 18, 20, determine which of the
following statements accurately describes the data. (0.5)
1. The mean, mode and median are equal.
2. The mean is greater than the median, but mode is less than the median.
3. ** The mode is greater than both the median and the mean.
4. The mean is less than the mode, but greater than the median.
Q5. Evaluate the following situation: You have a dataset with a kurtosis value of 0. What does
this indicate about the spread of data points in comparison to a normal distribution? (0.5)
1. ** The data is more spread out and has a flatter peak
2. The data is less widespread and has a sharper peak
3. The data is normally distributed
4. The data has the same spread as a normal distribution but with more extreme
values
Q6. You are working with a dataset that has a large standard deviation. What does this imply
about the values in the dataset relative to the mean? (0.5)
1. Most of the values are clustered closely around the mean
2. ** The values are widely spread out around the mean
3. The dataset is likely to have no outliers
4. The mean and standard deviation are equal
Q7. A researcher wants to test whether a new teaching method improves student test scores. What
would be the null hypothesis in this case? (0.5)
Q8. When conducting a hypothesis test, if the p-value is less than the chosen level of significance
(alpha), what should you do? (0.5)
Q10. A scientist wants to test whether a new drug has a different effect on blood pressure
compared to a placebo. The null hypothesis states that there is no difference in blood pressure
between the drug and placebo groups. A hypothesis test is performed at a significance level
of 0.05. Which of the following steps is appropriate when conducting the hypothesis test?
(0.5)
1. **Increases
2. Decreases
3. Remains same
4. None of the above
Type: DES
Q11. A company tests the efficiency of three different advertising strategies (Ad A, Ad B, Ad C) by
measuring the number of sales (in thousands) generated by each strategy over 4 days:. (4)
Day Ad A Ad B Ad C
1 30 22 25
2 35 28 30
3 40 32 35
4 42 30 40
Use one-way ANOVA to determine if there is a significant difference in the mean number of sales
across the three advertising strategies at a 5% significance level.
Solution: Calculations of Group Mean and overall mean = 1M, Sum of Squares = 1M, Mean of
Squares = 1M, F- Statistic & Conclusion = 1M
Q12. Given the dataset with following values
Values=[5,15,25,35,45,55,65,75,85,95]
a. Explain the need for data transformation in data analytics and how Min-Max
Normalization and Decimal Scaling help in preparing data for analysis.
b. Apply Min-Max Normalization to the dataset to transform the values to a range of [0, 1].
Show your calculations and results.
c. Apply Decimal Scaling to the dataset using a scaling factor of 100. Show your
calculations and results. (3)
Solution: Explanation on Normalization, min-max and decimal scaling (1M), b) Min max-
normalization (1M) and c) Decimal Scaling (1M).
(a) Normalization uses a mathematical function to transform numeric columns to a new range.
Normalization is important in preventing certain data analysis methods from giving some
variables undue influence over others because of differences in the range of their values.
The min–max transformation maps the values of a variable to a new range, such as from 0 to
1. The decimal scaling transformation moves the decimal point to ensure the range is
between 1 and −1.
b)
5 0
15 0.11
25 0.22
35 0.33
45 0.44
55 0.55
65 0.66
75 0.77
85 0.88
95 1
c)
Original Value Transformed Value
5 0.05
15 0.15
25 0.25
35 0.35
45 0.45
55 0.55
65 0.65
75 0.75
85 0.85
95 0.95
Q13. You are analyzing a dataset of customer transaction amounts for a retail company. The
dataset contains the following transaction values:
Values=[10,12,14,18,22,24,25,28,30,50]
a. As part of your analysis, you need to evaluate the role of the Interquartile Range (IQR) in
identifying outliers. Calculate the IQR for this dataset and determine if there are any
outliers.
b. Analyze how identifying and addressing outliers using the IQR can impact on the overall
quality of your analysis and the insights you can derive about customer spending
behavior. (3)
Solution
Calculation of Q1 and Q2 is (0.5 M each) 1M. Upper Bound (0.5M) Lower Bound (0.5M) Analysis
(1M)
Q3(Third Quartile)
The 75th percentile
Upper half: [24, 25, 28, 30, 50]
Median=28
IQR=Q3-Q1
IQR =28-14=14
Determine any outliers
Lower bound= Q1-1.5 X IQR
=14-1.5 X14
=-7
B ) Inclusion of outliers inflate or deflate the mean transaction value, may contains variability in
customer spending patterns. Identifying and addressing outliers using the IQR improves the
quality of analysis by preventing distortion in key metrics like the mean and variance, leading to
more accurate insights about typical customer behavior
Q14. Consider a dataset containing the following information about a set of customers:
• Age
• Annual Income
• Spending Score (a measure of customer behaviour)
Using this dataset, perform an Exploratory Data Analysis to answer the following:
a. Identify the basic summary statistics (mean, median, and mode) for the Age and Annual
Income columns.
b. Identify any outliers in the Annual Income column using the Interquartile Range (IQR)
method.
c. Interpret the relationship between Age and Spending Score using a scatter plot or
correlation analysis. . (3)
1. Basic Summary Statistics (Age and Annual Income) for the considered sample dataset
For each column, calculate:
• Mean: Average value.
• Median: Middle value.
• Mode: Most frequent value.
Q15. A company is analyzing the distribution of delivery times for their products. After collecting
data, they notice that most deliveries happen around 2-3 days, but a few deliveries take much longer
due to unexpected delays.
a. Explain how skewness affects the distribution of delivery times and its impact on the mean,
median, and mode of the dataset . (3)
Solution:
• 1 mark for explaining skewness and identifying it as positive skewness.
• 1 mark for describing the impact on the mean (being higher due to outliers).
• 1 mark for explaining the impact on the median and mode, with the median being less
affected and the mode representing the most frequent value.
• Skewness: The distribution of delivery times is positively skewed since most deliveries
happen within a short time (2-3 days), but a few take much longer, pulling the tail to the
right.
• Impact on Mean, Median, and Mode:
o Mean: Since the data is positively skewed, the mean will be higher than the median because
the longer delivery times (outliers) increase the average.
o Median: The median, being the middle value, is less affected by the outliers and will be
closer to the bulk of the data (around 2-3 days).
o Mode: The mode will represent the most frequent delivery time, likely around 2-3 days,
unaffected by the skew.
Q16. A research study aimed to assess the effectiveness of a six-month high-intensity interval
training (HIIT) program in lowering heart rates. For adults in China, the average heart rate is
typically 72 beats per minute. After participating in HIIT, a sample of 25 individuals recorded
an average heart rate of 69 beats per minute, with a standard deviation of 6.5 beats per
minute. Using a statistical test, determine if there is significant evidence to suggest that the
HIIT successfully reduced heart rates
H0:μ=72
HA:μ<72
Where
. (3)
Q17. A wildlife biologist is studying the alertness levels (arousal) of a population of "chill penguins"
living in a tropical zoo. The arousal levels in this population are normally distributed, with a known
standard deviation of 6. The biologist collects a sample of 49 "chill penguins" and measures their
arousal, finding a sample mean arousal level of 46.44 and a sample standard deviation of 5.6968.
Under normal conditions, the expected arousal level of these penguins is 47. Using a significance
level of α = 0.01, test whether the observed sample mean of 46.44 is significantly less than the
expected population mean of 47.
State the Hypotheses
• Null Hypothesis (H0): H0:μ=47
• Alternative Hypothesis (HA): HA:μ<47
This is a one-tailed test.
z-statistic: −0.653
The critical value for a one-tailed z-test at α=0.01 approximately −2.33
we fail to reject the null hypothesis.
Scheme:
Is the mean value for patients significantly differ from the mean value of general population (12
gm/dl)? Evaluate the role of chance. ( = 0.05)
Scheme
Q19. A company is evaluating the impact of two distinct advertising strategies (Strategy A and
Strategy B) across three regions (Region 1, Region 2, and Region 3) to understand how they influence
sales. The marketing team gathers sales performance data after applying both strategies in all three
regions.
1. Based on the given scenario, what is the appropriate statistical technique the company
should use to determine if there is a significant effect of the advertising strategy, region, or
their on sales?
2. After selecting the appropriate statistical technique, what kinds of conclusions could the
company expect from the analysis of the data?