0% found this document useful (0 votes)
16 views12 pages

DA Answer-Key

The document consists of multiple-choice questions (MCQs) and descriptive questions (DES) related to data analysis, statistics, and hypothesis testing. It covers topics such as derived variables, variability in histograms, median calculations, kurtosis, standard deviation implications, hypothesis testing, chi-square tests, outlier detection using IQR, exploratory data analysis, skewness effects, and statistical tests for evaluating the effectiveness of interventions. The questions require calculations, explanations, and interpretations of statistical concepts and methods.

Uploaded by

khushpatel1222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

DA Answer-Key

The document consists of multiple-choice questions (MCQs) and descriptive questions (DES) related to data analysis, statistics, and hypothesis testing. It covers topics such as derived variables, variability in histograms, median calculations, kurtosis, standard deviation implications, hypothesis testing, chi-square tests, outlier detection using IQR, exploratory data analysis, skewness effects, and statistical tests for evaluating the effectiveness of interventions. The questions require calculations, explanations, and interpretations of statistical concepts and methods.

Uploaded by

khushpatel1222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Type: MCQ

Q1. In reviewing metadata that describes a dataset's variables, you notice several variables are
marked as "derived." What does this indicate about those variables? (0.5)

1. **They are calculated from other variables in the dataset


2. They are collected directly from external sources
3. They are manually entered by the data analyst
4. They are irrelevant to the data analysis

Q2. You compare two histograms: Histogram X is wider with more spread in data, and
Histogram Y is narrower. What can you infer about the variability of data in each histogram?
(0.5)
1. ** Histogram X shows higher variability compared to Histogram Y
2. Histogram Y shows higher variability compared to Histogram X
3. Both histograms have identical variability
4. Histogram X and Histogram Y have no data variability

Q3. Analyze how the presence of multiple duplicate entries of numerical values in a dataset
impacts the calculation of the median (0.5)
1. The median is always one of the duplicate values
2. The median is the average of the duplicate values
3. ** The median is unaffected by the duplicates
4. The median cannot be determined if duplicates are present.

Q4. Given the following data set: 4, 8, 8, 10, 12, 14, 14, 14, 18, 20, determine which of the
following statements accurately describes the data. (0.5)
1. The mean, mode and median are equal.
2. The mean is greater than the median, but mode is less than the median.
3. ** The mode is greater than both the median and the mean.
4. The mean is less than the mode, but greater than the median.
Q5. Evaluate the following situation: You have a dataset with a kurtosis value of 0. What does
this indicate about the spread of data points in comparison to a normal distribution? (0.5)
1. ** The data is more spread out and has a flatter peak
2. The data is less widespread and has a sharper peak
3. The data is normally distributed
4. The data has the same spread as a normal distribution but with more extreme
values
Q6. You are working with a dataset that has a large standard deviation. What does this imply
about the values in the dataset relative to the mean? (0.5)
1. Most of the values are clustered closely around the mean
2. ** The values are widely spread out around the mean
3. The dataset is likely to have no outliers
4. The mean and standard deviation are equal
Q7. A researcher wants to test whether a new teaching method improves student test scores. What
would be the null hypothesis in this case? (0.5)

1. The new teaching method has less effective on test scores.


2. **The new teaching method improves test scores.
3. The old teaching method is better than the new one.
4. The test scores are not related to the teaching method.

Q8. When conducting a hypothesis test, if the p-value is less than the chosen level of significance
(alpha), what should you do? (0.5)

1. Fail to accept the null hypothesis.


2. **Accept the null hypothesis.
3. Reject the null hypothesis.
4. Modify the null hypothesis.
Q9. A researcher is conducting a chi-square test to determine whether there is a significant
association between gender (male, female) and preference for a new product (prefer, do not
prefer). The data is summarized in a contingency table and the chi-square test is performed.
Which of the following steps should the researcher take to apply the chi-square test correctly?
(0.5)

1. Calculate the mean and standard deviation of the data.


2. **Compare the observed frequencies to the expected frequencies under the
assumption of no association.
3. Ensure that the sample size is greater than 30 before applying for the chi-square
test.
4. Use the chi-square test only if the data is normally distributed.

Q10. A scientist wants to test whether a new drug has a different effect on blood pressure
compared to a placebo. The null hypothesis states that there is no difference in blood pressure
between the drug and placebo groups. A hypothesis test is performed at a significance level
of 0.05. Which of the following steps is appropriate when conducting the hypothesis test?
(0.5)

1. **Increases
2. Decreases
3. Remains same
4. None of the above
Type: DES

Q11. A company tests the efficiency of three different advertising strategies (Ad A, Ad B, Ad C) by
measuring the number of sales (in thousands) generated by each strategy over 4 days:. (4)

Day Ad A Ad B Ad C

1 30 22 25

2 35 28 30

3 40 32 35

4 42 30 40

Use one-way ANOVA to determine if there is a significant difference in the mean number of sales
across the three advertising strategies at a 5% significance level.
Solution: Calculations of Group Mean and overall mean = 1M, Sum of Squares = 1M, Mean of
Squares = 1M, F- Statistic & Conclusion = 1M
Q12. Given the dataset with following values
Values=[5,15,25,35,45,55,65,75,85,95]

a. Explain the need for data transformation in data analytics and how Min-Max
Normalization and Decimal Scaling help in preparing data for analysis.
b. Apply Min-Max Normalization to the dataset to transform the values to a range of [0, 1].
Show your calculations and results.
c. Apply Decimal Scaling to the dataset using a scaling factor of 100. Show your
calculations and results. (3)
Solution: Explanation on Normalization, min-max and decimal scaling (1M), b) Min max-
normalization (1M) and c) Decimal Scaling (1M).

(a) Normalization uses a mathematical function to transform numeric columns to a new range.
Normalization is important in preventing certain data analysis methods from giving some
variables undue influence over others because of differences in the range of their values.

The min–max transformation maps the values of a variable to a new range, such as from 0 to
1. The decimal scaling transformation moves the decimal point to ensure the range is
between 1 and −1.

b)

Original Value Transformed Value

5 0

15 0.11

25 0.22

35 0.33

45 0.44

55 0.55

65 0.66

75 0.77

85 0.88

95 1

c)
Original Value Transformed Value

5 0.05

15 0.15

25 0.25

35 0.35

45 0.45

55 0.55

65 0.65

75 0.75

85 0.85

95 0.95

Q13. You are analyzing a dataset of customer transaction amounts for a retail company. The
dataset contains the following transaction values:

Values=[10,12,14,18,22,24,25,28,30,50]

a. As part of your analysis, you need to evaluate the role of the Interquartile Range (IQR) in
identifying outliers. Calculate the IQR for this dataset and determine if there are any
outliers.
b. Analyze how identifying and addressing outliers using the IQR can impact on the overall
quality of your analysis and the insights you can derive about customer spending
behavior. (3)

Solution

Calculation of Q1 and Q2 is (0.5 M each) 1M. Upper Bound (0.5M) Lower Bound (0.5M) Analysis
(1M)

a) Arrange the data in ascending order


Values=10,12,14,18,22,24,25,28,30,50
Q1(First Quartile)=lower half=[10,12,14,18,22]
The 25th percentile
Calculate median= 14

Q3(Third Quartile)
The 75th percentile
Upper half: [24, 25, 28, 30, 50]
Median=28

IQR=Q3-Q1

IQR =28-14=14
Determine any outliers
Lower bound= Q1-1.5 X IQR
=14-1.5 X14
=-7

Upper bound=Q3+1.5 X IQR


28+1.5 X 14
=28+21
=49

Values outside the range [-7,49] are considered as outliers.

In the dataset 50 is the outlier

B ) Inclusion of outliers inflate or deflate the mean transaction value, may contains variability in
customer spending patterns. Identifying and addressing outliers using the IQR improves the
quality of analysis by preventing distortion in key metrics like the mean and variance, leading to
more accurate insights about typical customer behavior

Q14. Consider a dataset containing the following information about a set of customers:

• Age
• Annual Income
• Spending Score (a measure of customer behaviour)
Using this dataset, perform an Exploratory Data Analysis to answer the following:
a. Identify the basic summary statistics (mean, median, and mode) for the Age and Annual
Income columns.
b. Identify any outliers in the Annual Income column using the Interquartile Range (IQR)
method.
c. Interpret the relationship between Age and Spending Score using a scatter plot or
correlation analysis. . (3)

Solution: Basic summary = 1M, Outliers =1M, Relationship analysis =1M

1. Basic Summary Statistics (Age and Annual Income) for the considered sample dataset
For each column, calculate:
• Mean: Average value.
• Median: Middle value.
• Mode: Most frequent value.

2. Outliers in Annual Income (IQR Method)


The IQR method to detect outliers involves:
1. Calculate the first (Q1) and third (Q3) quartiles of the Annual Income data.
o Q1 = Calculated value will be as per considered data
o Q3 = Calculated value will be as per considered data
2. IQR = Q3 - Q1
3. Identify the lower and upper bounds for outliers using:
o Lower bound = Q1 - 1.5 * IQR
o Upper bound = Q3 + 1.5 * IQR
o Identification of outlier based on considered dataset.
3. Relationship Between Age and Spending Score
To assess the relationship:
• Scatter Plot: Plot Age on the x-axis and Spending Score on the y-axis. If the points (as per
considered data), based on the trend (upward or downward), it can be concluded the
positive or negative correlation
OR
• Correlation Analysis: Calculate the correlation coefficient., derive the conclusion (conclusion
will be based on considered dataset)

Q15. A company is analyzing the distribution of delivery times for their products. After collecting
data, they notice that most deliveries happen around 2-3 days, but a few deliveries take much longer
due to unexpected delays.

a. Explain how skewness affects the distribution of delivery times and its impact on the mean,
median, and mode of the dataset . (3)

Solution:
• 1 mark for explaining skewness and identifying it as positive skewness.
• 1 mark for describing the impact on the mean (being higher due to outliers).
• 1 mark for explaining the impact on the median and mode, with the median being less
affected and the mode representing the most frequent value.
• Skewness: The distribution of delivery times is positively skewed since most deliveries
happen within a short time (2-3 days), but a few take much longer, pulling the tail to the
right.
• Impact on Mean, Median, and Mode:
o Mean: Since the data is positively skewed, the mean will be higher than the median because
the longer delivery times (outliers) increase the average.
o Median: The median, being the middle value, is less affected by the outliers and will be
closer to the bulk of the data (around 2-3 days).
o Mode: The mode will represent the most frequent delivery time, likely around 2-3 days,
unaffected by the skew.

Q16. A research study aimed to assess the effectiveness of a six-month high-intensity interval
training (HIIT) program in lowering heart rates. For adults in China, the average heart rate is
typically 72 beats per minute. After participating in HIIT, a sample of 25 individuals recorded
an average heart rate of 69 beats per minute, with a standard deviation of 6.5 beats per
minute. Using a statistical test, determine if there is significant evidence to suggest that the
HIIT successfully reduced heart rates
H0:μ=72

HA:μ<72

Where

Xˉ=69 (sample mean)

μ0=72 (population mean under the null hypothesis)

s=6.5 (sample standard deviation)

n=25n = 25n=25 (sample size)

The calculated t-statistic is approximately −2.31.

. (3)
Q17. A wildlife biologist is studying the alertness levels (arousal) of a population of "chill penguins"
living in a tropical zoo. The arousal levels in this population are normally distributed, with a known
standard deviation of 6. The biologist collects a sample of 49 "chill penguins" and measures their
arousal, finding a sample mean arousal level of 46.44 and a sample standard deviation of 5.6968.
Under normal conditions, the expected arousal level of these penguins is 47. Using a significance
level of α = 0.01, test whether the observed sample mean of 46.44 is significantly less than the
expected population mean of 47.
State the Hypotheses
• Null Hypothesis (H0): H0:μ=47
• Alternative Hypothesis (HA): HA:μ<47
This is a one-tailed test.
z-statistic: −0.653
The critical value for a one-tailed z-test at α=0.01 approximately −2.33
we fail to reject the null hypothesis.
Scheme:

State Hypothesis: 0.5 Mark

Calculation z stat: 0.5 Mark

Find critical value: 0.5 Mark

Decision about accept and Reject Hypothesis: 0.5 Mark


. (2)
Q18. The following data represents hemoglobin values in gm/dl for 10 patients:

10.5 9 6.5 8 11 7 7.5 8.5 9.5 12

Is the mean value for patients significantly differ from the mean value of general population (12
gm/dl)? Evaluate the role of chance. ( = 0.05)

Scheme

Calculation of mean and SD: 0.5 Mark


Calculation of Standerd Error: 0.5 Mark

Calculation of T Value: 0.5 Mark

Decision about accept and Reject Hypothesis: 0.5 Mark


. (2)

Q19. A company is evaluating the impact of two distinct advertising strategies (Strategy A and
Strategy B) across three regions (Region 1, Region 2, and Region 3) to understand how they influence
sales. The marketing team gathers sales performance data after applying both strategies in all three
regions.
1. Based on the given scenario, what is the appropriate statistical technique the company
should use to determine if there is a significant effect of the advertising strategy, region, or
their on sales?
2. After selecting the appropriate statistical technique, what kinds of conclusions could the
company expect from the analysis of the data?

1. Appropriate Statistical Technique: Two-Way ANOVA


In this scenario, the company is evaluating the impact of two factors (advertising strategy and
region) on sales (a continuous dependent variable). Since there are two independent variables
(advertising strategy and region) and the goal is to see if either of these factors, or their interaction,
significantly influences sales, the appropriate statistical technique to use is a Two-Way ANOVA
(Analysis of Variance).
• Factor 1: Advertising strategy (Strategy A vs. Strategy B)
• Factor 2: Region (Region 1, Region 2, Region 3)
This technique allows the company to assess:
1. The main effect of advertising strategy (A vs. B) on sales.
2. The main effect of region (Region 1, 2, 3) on sales.
3. The interaction effect between advertising strategy and region (i.e., whether the effect of
the advertising strategy depends on the region).
2. Expected Conclusions from the Analysis
After performing a Two-Way ANOVA, the company could draw the following types of conclusions:
• Main effect of advertising strategy: The company will determine whether there is a
significant difference in sales between Strategy A and Strategy B, irrespective of the region.
For example, they may conclude that one strategy consistently leads to higher sales across
all regions.
• Main effect of region: The company will assess whether there are significant differences in
sales between the regions, regardless of the advertising strategy used. This could help them
understand if certain regions generally perform better in terms of sales.
• Interaction effect: The analysis will show whether the effectiveness of an advertising
strategy varies by region. This would suggest that the best advertising strategy might depend
on the region. For example, Strategy A might perform better in Region 1 but worse in Region
3, indicating a need for a tailored approach.
Possible conclusions from the Two-Way ANOVA:
• If there’s no significant interaction but significant main effects: The company might
conclude that one strategy is generally better and that certain regions perform better
regardless of the strategy used.
• If there’s a significant interaction effect: The company would likely conclude that the best
advertising strategy depends on the specific region, and a "one size fits all" approach may
not work.
• If neither main effects nor interaction effects are significant: The company might conclude
that neither the advertising strategy nor the region has a significant impact on sales, and
other factors should be investigated.
Scheme
Appropriate statistical technique: 1 Mark
Expected conclusions: 1 Mark
. (2)

You might also like