0% found this document useful (0 votes)
7 views6 pages

QUIZ1 Solution

The document consists of a series of multiple-choice questions and case study questions related to data analysis, regression models, and business analytics. It covers topics such as box plots, histograms, normalization, correlation, and the differences between descriptive and predictive analytics. Additionally, it includes a case study on predicting house prices and a logistic regression model for fraud detection, along with calculations and recommendations for improving model accuracy.

Uploaded by

bharat.goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

QUIZ1 Solution

The document consists of a series of multiple-choice questions and case study questions related to data analysis, regression models, and business analytics. It covers topics such as box plots, histograms, normalization, correlation, and the differences between descriptive and predictive analytics. Additionally, it includes a case study on predicting house prices and a logistic regression model for fraud detection, along with calculations and recommendations for improving model accuracy.

Uploaded by

bharat.goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1) What does a box plot primarily represent?

A) Mean and standard deviation

B) Median, quartiles, and outliers

C) Mode and range

D) Skewness and kurtosis

2) What information does a histogram provide about a dataset?

a) The frequency distribution of continuous or discrete data

b) The correlation between two variables

c) The causal relationships between variables


d) The central tendency and spread of categorical data

3) What is the purpose of normalization in data preprocessing?

a) To reduce the data's dimensionality

b) To transform data into a common scale without distorting differences in the range of values

c) To handle missing values

d) To detect outliers

4)In a scatter plot, what does a strong positive correlation between two variables look like?

a) A cloud of points scattered randomly

b) A horizontal line of points

c) A vertical line of points


d) Points forming an upward-sloping line

5) What is a key difference between descriptive analytics and predictive analytics?

a) Descriptive analytics forecasts future events, while predictive analytics summarizes historical data.

b) Descriptive analytics explains what happened in the past, while predictive analytics forecasts what is
likely to happen in the future.

c) Descriptive analytics provides recommendations, while predictive analytics visualizes data.

d) Descriptive analytics optimizes business processes, while predictive analytics identifies patterns.

6) Which technique is commonly used in prescriptive analytics to provide actionable recommendations?

a) Regression Analysis
b) Simulation Models
c) Clustering
d) Data Summarization
7) How does Business Intelligence (BI) generally differ from Business Analytics?

a) BI focuses on historical reporting, while Business Analytics involves predictive and prescriptive
modeling.

b) BI involves data visualization, while Business Analytics includes real-time data analysis.

c) BI and Business Analytics are used interchangeably in all contexts.

d) BI focuses on forecasting, while Business Analytics is concerned with descriptive statistics.

8) In business analytics, which type of problem is best addressed by classification techniques?

a) Predicting continuous numerical values

b) Determining the impact of independent variables on a dependent variable

c) Assigning data points to predefined categories or classes

d) Estimating future trends based on historical patterns

Case Study Questions 9 - 16

Dependent Variable: House Prices (in $)

R-squared: 0.820
Adjusted R-squared: 0.815

Coefficient

Variable Coefficient Std. Error t-Statistic P-value

Intercept 50000 10000 5.00 0.000

Number of Bedrooms 30000 5000 6.00 0.000

Square Footage 150 25 6.00 0.000

Distance to City -20000 1000 -20.00 0.000

F-statistic: 150.00
Prob (F-statistic): 0.000

9)The coefficient for the variable "Distance to City" is -20000. What does this coefficient imply?
A) For every additional mile away from the city, the house price decreases by $20,000.
B) For every additional mile closer to the city, the house price decreases by $20,000.
C) For every additional mile away from the city, the house price increases by $20,000.
D) For every additional mile away from the city, the house price stays the same.

10) The R-squared value is 0.820. What does this tell you about the model?

A) The model explains 18% of the variance in house prices.


B) The model explains 82% of the variance in house prices.
C) The model fits the data perfectly.
D) The model does not explain any of the variance in house prices.

11) What does a p-value of 0.000 for all independent variables indicate?

A) All independent variables have no effect on house prices.


B) All independent variables are statistically insignificant.
C) All independent variables are statistically significant at any conventional level (e.g., 1%, 5%).
D) The model does not explain any variance in house prices.

12) Which variable has a larger impact on house prices according to the regression output?

A) Number of Bedrooms
B) Square Footage
C) Distance to City
D) Intercept

13)The adjusted R-squared value is 0.815, slightly lower than the R-squared of 0.820. Why is this?

A) Adjusted R-squared is always higher than R-squared.


B) Adjusted R-squared accounts for the number of predictors and penalizes adding irrelevant variables.
C) Adjusted R-squared only applies to time series models.
D) Adjusted R-squared is not relevant in this case.

14) The model's intercept is 50000. What does this value represent?

A) The baseline house price with zero bedrooms, zero square footage, and zero distance from the city
center.
B) The baseline house price when all variables are at their mean value.
C) The maximum possible house price in the model.
D) It has no practical significance.

15) If a house has 4 bedrooms, 3000 square feet, and is located 5 miles from the city center, what would
the predicted price be?

16) The model’s F-statistic is 150. What does a high F-statistic suggest about the model?

A) The model explains all of the variance in house prices.


B) The overall regression model is statistically significant, and at least one predictor has a non-zero
coefficient.
C) The independent variables are not related to the dependent variable.
D) The model does not fit the data well.

17) A retail company wants to forecast monthly sales based on advertising spend, time of the year, and
number of competitors in the market. The company fits a linear regression model and finds that although
the model performs well on historical data, the forecasted sales for the upcoming months are consistently
underestimated.
Question:
What could be the possible reasons for this underestimation? Suggest a strategy to address this issue and
improve the accuracy of the forecasts.

18) A logistic regression model predicts a probability of 0.62 for an event. If the threshold for
classification is 0.5, what is the predicted class for this observation?

A) Class 0
B) Class 1
C) The model is inconclusive
D) Need more data to classify

19) A logistic regression model predicts 80 instances, out of which 60 were classified correctly. What
is the model's accuracy?
A) 0.60 B) 0.65 C) 0.75 D) 0.80

20) If a logistic regression model has a true positive rate of 0.85 and a false positive rate of 0.10, what is
the value of the area under the ROC curve (AUC)?
A) 0.10
B) 0.50
C) 0.85
D) 0.95

Q21 ) An online payment platform is using logistic regression to predict whether a transaction is
fraudulent. The dataset includes features such as transaction amount, transaction time, location, and
device type. The model outputs the probability of fraud for each transaction. (10 marks)

Questions:

1. For a particular transaction, the model predicts a probability of 0.85 for fraud. What decision
would the platform make if it sets the classification threshold at 0.7? Would you recommend
adjusting the threshold? Why or why not?
Detailed Soln -
Decision: Since the predicted probability of fraud (0.85) exceeds the classification threshold of
0.7, the platform would classify this transaction as fraudulent and flag it for further investigation.

Threshold Adjustment Recommendation: Whether to adjust the threshold depends on the business
goals:

If the platform wants to minimize false positives (legitimate transactions incorrectly flagged), it
should increase the threshold to make the fraud detection stricter.

If the platform wants to catch more fraud (reduce false negatives), it should lower the threshold to
flag more transactions as fraudulent.

Recommendation: If minimizing fraud is a priority (which is common in fraud detection), you


may not want to raise the threshold. In contrast, if the platform is flagging too many legitimate
transactions, adjusting the threshold upward could help reduce the false positives.
2. Explain the consequences of false positives (a legitimate transaction flagged as fraudulent) and
false negatives (a fraudulent transaction allowed to proceed). In this context, which is more
critical to minimize?

Ans :

False Positives (FP): A legitimate transaction is incorrectly flagged as fraud. The consequences are:
Customer frustration due to transaction delays or cancellations.

Possible loss of business or customer trust.

Operational costs associated with investigating false alerts.

False Negatives (FN): A fraudulent transaction is allowed to proceed. The consequences are:

Financial loss to the company due to undetected fraud.

Reputational damage, especially if the fraud is discovered later.

Increased fraud risk, as fraudsters may continue exploiting weaknesses in the system.
More Critical to Minimize: In this case, false negatives (fraudulent transactions allowed to
proceed) are more critical to minimize. Fraud results in direct financial loss and potential legal and
reputational risks, making it the higher-priority issue.

3. Precision-Recall Trade-off:
After evaluating the model, you find that the precision is 0.80 and the recall is 0.60. If the
company wants to minimize the number of legitimate transactions incorrectly flagged as fraud,
how would you adjust the model’s threshold?

Ans :

Precision (0.80): Out of all flagged fraudulent transactions, 80% were actually fraudulent.

Recall (0.60): Out of all actual fraud cases, only 60% were detected by the model.
Adjusting the Threshold:

If the company wants to minimize the number of legitimate transactions incorrectly flagged as
fraud, it needs to increase precision. This can be achieved by raising the threshold for flagging
fraud. By increasing the threshold, the model will become more conservative and flag fewer
transactions, thereby reducing false positives but potentially missing more actual fraud cases
(lowering recall).

Recommendation: Adjust the threshold upward if the cost of incorrectly flagging legitimate
transactions is higher than missing a few fraud cases.

4. Confusion Matrix Analysis:


The confusion matrix for the model is as follows:
True Positives: 250, True Negatives: 500, False Positives: 150, False Negatives : 100

Calculate the model’s accuracy and F1 score. What do these metrics reveal about the model's fraud
detection capability?
Interpretation:

Accuracy (0.75): The model correctly classifies 75% of the transactions overall. However, accuracy alone
may not fully reflect the model's performance in an imbalanced fraud detection context, where false
negatives can be very costly.

F1 Score (0.666): The F1 score is lower than the accuracy, which suggests that the model is
struggling to balance precision and recall. This metric shows that the model's overall performance,
especially in detecting fraud (TP vs FP and FN), could be improved.

5. Business Decision:
Fraudulent transactions cost the company $1000 per case, while incorrectly flagging a legitimate
transaction costs $50. Based on the confusion matrix, calculate the total cost of the current
model’s predictions. How would you adjust the model to reduce overall costs?

Ans : Total cost for false positives = 150 × 50 = $7,500

Total cost for false negatives = 100 × 1,000 = $100,000

Total Cost: $7,500 + $100,000 = $107,500

Adjusting the Model:


Since the cost of false negatives is much higher than false positives, the model should prioritize
minimizing false negatives (i.e., detecting more fraud). To do this, lowering the threshold for
classifying fraud could help capture more fraudulent transactions, even if it slightly increases
the number of false positives.

The goal is to balance the trade-off between false positives and false negatives in a way that
minimizes the overall financial cost. Lowering the threshold would likely lead to a higher recall,
which would reduce the more expensive false negatives.

MCQ Answers:

1 – B , 2 -A, 3- B, 4- D, 5-B, 6-B, 7 –B, 8 -C, 9-A ,10-B, 11- C, 12 – A, 13 – B, 14-A

15) Predicted Price = 50000 + (30000 × 4) + (150 × 3000) + (-20000 × 5)

= 50,000 + 120,000 + 450,000 - 100,000 = $520,000

16- B 17) - 18 - B, 19 -C, 20-C 60/80=0.75

20) C - The AUC is approximately equal to the true positive rate (0.85) when the false positive
rate is small (0.10)

You might also like