0% found this document useful (0 votes)
26 views9 pages

W3 (Extra) - Data 123 Practice Open Questions With Means

Uploaded by

z13612909240
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views9 pages

W3 (Extra) - Data 123 Practice Open Questions With Means

Uploaded by

z13612909240
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Question 1:

"The management of a company is interested in evaluating the impact of a recent training program
on the performance of its employees. To assess this, they collected data on the pre-training and post-
training scores of a random sample of employees. The company believes that, on average, the
training program should result in a positive improvement in the scores.

- A: Which statistical test would be appropriate to investigate the belief of the company that
about the training program? T-test one sample
- B: Explain in a few lines why you chose this test.
o Because we are interested in 1 sample (the one of the differences)
- C: Show (copy paste) the commands or steps you would use to perform the statistical
analysis (even if you were unable to execute the analysis). Also upload the output
- #1 sample t-test
- # Step 1: Compute the differences btw post and pre
- data123$differences <- data123$Post_training - data123$Pre_training
-
- # Step 2: Compute the test
- t.test(data123$differences)

Output:
One Sample t-test

data: data123$differences
t = 6.0243, df = 49, p-value = 2.147e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
4.112664 8.229868
sample estimates:
mean of x
6.171266
- D: Based on your analysis, provide a conclusion regarding whether there was a significant
change in performance."

P-value < 0.05, so we reject H0. This means that there is a significant improvement in performance
after training program.

- E: Use a 95% confidence interval to


o 1st : answer the question: Since 0 is outside the confidence interval, we reject h0 and
conclude that there is a significant improvement in performance after training
program.
o 2nd : explain what does this interval mean: The confidence intervals tells we are 95%
confident that the difference is between 4.1 and 8.2
Question 2:

Imagine you are tasked with analyzing the impact of employee experience on the performance
before the training. In your dataset (training_data) the variable "Experience" represents the years of
experience that each employee has and it is used to differentiate the employees between
experienced (Group A) and not experienced (Group B).

Your objective is to determine whether there is a significant difference in pre-training performance


between employees with less than 5 years of experience (Group A) and those with more than 5 years
of experience (Group B).

- A: Which statistical test would be appropriate? We choose a two sample t-test


- B: Explain in a few lines why you chose this test. Because we have 2 independent samples
(group A and group B) which variances are similar
- C: Show (copy paste) the commands or steps you would use to perform the statistical
analysis (even if you were unable to execute the analysis). Also upload the output

var(data123$Pre_training[data123$Group == "Group A"])

#result for group A = 77.51

var(data123$Pre_training[data123$Group == "Group B"])

#result for group B = 90.27

90.27/77.51

# The ratio btw them is 1.16 < 2 so we can assume equal variance

t.test(Pre_training ~ Group, data = data123, var.equal = TRUE)


Two Sample t-test

data: Pre_training by Group


t = 0.61765, df = 48, p-value = 0.5397
alternative hypothesis: true difference in means between group Group A and
group Group B is not equal to 0
95 percent confidence interval:
-4.088114 7.713497
sample estimates:
mean in group Group A mean in group Group B
71.64917 69.83648

- D: Based on your analysis, provide a conclusion regarding whether there is a significant


difference in effectiveness."

P-Value >0.05 so we cannot reject H0. We don’t have enough evidence to say that there is a
significant difference btw the 2 groups in performance before training.

- E: Use a 95% confidence interval to answer the question and explain what does this interval
mean.

0 lies between the confidence interval, which means that in 95% of the times the averages in
performance (before training) are similar. The confidence intervals tells we are 95% confident that
the difference is between the 2 groups is between -4 and 7.
Question 3:

"A multinational company is curious about the motivational levels of its employees from different
nationalities. The company collected data on motivation scores, categorizing employees into three
nationalities: Spanish, Dutch, and German. The company hypothesizes that there might be
differences in motivation levels among these nationalities.

- A: Which statistical test would you use to explore whether there are significant differences in
motivation levels among the three nationalities? F-test from Anova
- B: Explain in a few lines why you selected this test. We chose this because we are interested
in the effect of nationality (including all groups) on motivation.
- C: Show (copy paste) the commands or steps you would use to perform the statistical
analysis (even if you were unable to execute the analysis). Also upload the output

model_original = lm(Motivation ~ Nationality, data = data123)


summary(model_original)

- D: Based on your analysis, provide a conclusion regarding whether there is a significant


difference in effectiveness."
Call:
lm(formula = Motivation ~ Nationality, data = data123)

Residuals:
Min 1Q Median 3Q Max
-3.2604 -0.7765 -0.0648 1.1413 2.2651

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.5845 0.2844 9.086 6.42e-12 ***
NationalityGerman 1.0661 0.4345 2.454 0.0179 *
NationalitySpanish 3.4010 0.4345 7.827 4.62e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.272 on 47 degrees of freedom


Multiple R-squared: 0.5698, Adjusted R-squared: 0.5515
F-statistic: 31.13 on 2 and 47 DF, p-value: 2.456e-09

For Elke:

- 1st df = number of groups - 1 -> 3-1 = 2


- 2nd df = total people – total groups = 50-3 = 47

P-Value <0.05 so we can say that the motivational levels differ per each nationality.

Extra question: Imagine I ask you whether Spanish are more motivated than dutch ?

We can see that when comparing Spanish to the reference (Dutch), the P-value <0.05, which means
that there is a significant difference in motivation between these 2 groups. On average, Spanish score
3.4 higher in motivation than dutch.
Question 3.5

Some people claim that Spanish people are more motivated than the rest of the nationalities, can
you answer this question with your current output ? If not do something about it and answer the
following questions:

- C: Which statistical test would you use to answer the question?


- D: Explain in a few lines why you selected this test.
- C+D answer: Since we are interested in comparing Spanish to the rest, we compute lm model
with Spanish as reference
- E: Show (copy paste) the commands or steps you would use to answer the new question 3.5
Also upload the output

data123$German_dummy = ifelse(data123$Nationality == "German", 1, 0)


data123$Dutch_dummy = ifelse(data123$Nationality == "Dutch", 1, 0)
data123$Spanish_dummy = ifelse(data123$Nationality == "Spanish", 1, 0)

model_spanish_ref<- lm(Motivation ~ Dutch_dummy + German_dummy, data = data123)


summary(model_spanish_ref)

- F: Based on your analysis, provide a conclusion regarding the claim.

Call:
lm(formula = Motivation ~ Dutch_dummy + German_dummy, data = data123)

Residuals:
Min 1Q Median 3Q Max
-3.2604 -0.7765 -0.0648 1.1413 2.2651

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.9855 0.3284 18.224 < 2e-16 ***
Dutch_dummy -3.4010 0.4345 -7.827 4.62e-10 ***
German_dummy -2.3349 0.4645 -5.027 7.68e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.272 on 47 degrees of freedom


Multiple R-squared: 0.5698, Adjusted R-squared: 0.5515
F-statistic: 31.13 on 2 and 47 DF, p-value: 2.456e-09

So the claim is indeed correct as Spanish are the most motivated. Both P-values are significant and
on average germans scored 2.3 lower in motivation than Spanish and Dutch scored 3.4 lower in
motivation than Spanish.
#Codes:

# Data simulation

set.seed(123)

n_employees <- 50

Employee_ID <- 1:n_employees

Pre_training <- rnorm(n_employees, mean = 70, sd = 10)

Post_training <- Pre_training + rnorm(n_employees, mean = 5, sd = 8)

Age <- sample(25:45, n_employees, replace = TRUE)

Experience <- sample(1:10, n_employees, replace = TRUE)

Motivation <- rnorm(n_employees, mean = 3, sd = 1)

data123 <- data.frame(Employee_ID, Pre_training, Post_training, Age, Experience, Motivation)

data123$Group <- ifelse(data123$Experience < 5, "Group A", "Group B")

data123$Nationality <- sample(c("Spanish", "Dutch", "German"), nrow(data123), replace = TRUE)

data123$Motivation <- with(data123, ifelse(Nationality == "Spanish", rnorm(n_employees, mean = 6,


sd = 1.53),

ifelse(Nationality == "Dutch", rnorm(n_employees, mean = 3, sd = 1.12),

rnorm(n_employees, mean = 3.5, sd = 0.97))))

library(tidyverse)

##Question 1: One sample t-test

#Step 1: Create your new sample

difference <- data123$Post_training- data123$Pre_training

#Step 2: Do the t-test

t.test(difference)

#Other method, but same result


data123$difference2.0 <- data123$Post_training - data123$Pre_training

t.test(data123$difference2.0)

#Interpretation:

# P-Value is <0.05 we reject H0 (u(diff = 0)), so we can say that there is a significant improvement.

# Using the confidence interval, we see tat 0 it's outside, so we reject H0.

##Question 2: Two sample t-test

# We have 2 samples: We investigate the difference in pretraining scores between group a and b

# The samples are independent

# We need to see whether there is equal variance or not, HOW ?

# we compute the variances and calculate the rule of thumb

var(data123$Pre_training[data123$Group == "Group A"])

var(data123$Pre_training[data123$Group == "Group B"])

# rule of thumb : var(biggest)/var(smallest)

90.27/77.51

# Since the result 1.16 is <2 we can assume equal variance, so in the following code we write
var.equal = TRUE

t.test(Pre_training ~ Group, data=data123, var.equal = TRUE)

##Question 2: More than 2 samples

# 3 samples:

model_original <- lm(Motivation ~ Nationality, data = data123)

summary(model_original)

# If question is about general differences between all groups -> Look at the F-test

# the implied null hypothesis H0 is that all means are similar µ(german)=µ(dutch)=µ(spanish)

# In our example : F = 0.328 and the corresponding P-Value = 0.7 > 0.05 so we don't reject H0
# Conclusion I can't say that the average hapiness levels are different between nationalities

# If quesiton is about two specific nationalities, check who is the reference category

# (it is the one that is missing in the output)

# Originally (model_original) the missing category is dutch, so it is the REFERENCE.

# With this output (model_original) you can interpret the differences between german and dutch or
spanish and dutch

# However, we can't say much about the differences between german and spanish, so we need to
change the reference

#Step 1: Make dummy variables

data123$German_dummy = ifelse(data123$Nationality == "German", 1, 0)

data123$Dutch_dummy = ifelse(data123$Nationality == "Dutch", 1, 0)

data123$Spanish_dummy = ifelse(data123$Nationality == "Spanish", 1, 0)

#Step 2: Re-create the model, by excluding one of the dummies, it automatically makes that variable
the reference

#Here we want spanish as reference, so we don't include it in the model

model_spanish_ref<- lm(Motivation ~ Dutch_dummy + German_dummy, data = data123)

summary(model_spanish_ref)

You might also like