Practice test 1.
5:
Question 1:
A researcher is conducting a study to investigate the impact of three different teaching methods (A,
B, C) on students' exam scores. What is the measurement level of this variable?
a. Interval
b. Nominal
c. Ordinal
d. Ratio
Question 2:
Given the following table, calculate the chi square statistic:
x/y Group1 Group 2
A 40 25
B 25 40
Question 3
a- The value of the intercept is:
b- The value of the slope is:
c- The sign of the relationship is:
d- If the x value is 2, the predicted y value is:
Question 4:
A group of scientists collected data on the heights of a random sample of 120 individuals. The mean
height is 65 inches, and the standard deviation is 5 inches. Calculate the 95% confidence interval for
the population mean height.
Question 5:
A company wants to estimate the average satisfaction score of its customers. They are willing to
accept a margin of error of 2. How large should the sample be? (Rounding errors will be accepted).
Question 6:
We want to assess the relationship between the severity of a health condition (categorized as "Mild,"
"Moderate," "Severe.") and the effectiveness of different treatment methods categorized as "Low,"
"Moderate," "High." Which statistical measure is appropriate for assessing the relationship between
the two variables?
a. Kendall's tau-c
b. Pearson's correlation
c. Cramer's V
d. Kendall's tau-b
Question 7:
- Reference category
- Group 1
a- The value of the intercept (reference category) is:
b- The value of the statistic associated with group 1 is:
c- The value of the slope is:
Question 8:
In a survey of 550 students, 48% of them reported that they use social media regularly. Calculate the
95% confidence interval for the proportion of people in the population who use social media
regularly.
Question 9:
A chi-square test was conducted to examine the association between smartphone ownership (yes or
no) and the preference for a specific mobile app (App A, App B, No Preference). The chi-square
statistic is 9.45 with 2 degrees of freedom. What is the p-value associated with this test?
Question 10 :
A researcher wants to analyze the association between income and the preference for a specific
brand of a product (Brand A, Brand B, or No Preference). What measure would be suitable for this
analysis?
a. Pearson’s or Spearman's correlation
b. Kendall's tau-c
c. Cramer's V
d. Kendall's tau-b
Question 11:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.5302 0.3124 -4.896 < 0.001 ***
x1 0.8215 0.0427 19.256 < 2e-16 ***
x2 -1.2763 0.0401 -31.822 < 2e-16 ***
x3 0.0157 0.0392 0.401 0.689
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.12 on 236 degrees of freedom
Multiple R-squared: 0.9458, Adjusted R-squared: 0.9451
F-statistic: 1395 on 3 and 236 DF, p-value: < 2.2e-16
a- Create a 95% confidence interval for the variable x1
b- Create a 95% confidence interval for the variable x3
c- Which of the following statements are correct?
a. A high (and significant) F value means that the model estimated here is correct ?
b. In this model all variables are associated with the dependent variable?
Question 12:
A company conducted a survey to measure the average satisfaction level of its customers. The mean
satisfaction score in the surveyed population is 65, with a standard deviation of 10. Now, the
company wants to estimate the percentage of customers in the population who are likely to have a
higher satisfaction level than Sarah. Sarah's satisfaction score is 85.
Using this information, what is the estimated percentage of customers in the population who have a
higher satisfaction level than Sarah?
Question 13:
Given the following table, calculate the chi square statistic:
x/y Group1 Group 2
A 60 25
B 25 15
Question 14:
Call:
lm(formula = salary ~ working_hours, data = exam_data)
Residuals:
Min 1Q Median 3Q Max
-3.210 -1.758 -0.422 1.144 4.688
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.321 1.129 4.717 1.67e-05 ***
working_hours 2.564 0.206 12.459 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- What is the expected salary of an employee who worked 10 hours?
- Suppose one person who worked 8 hours and had a residual of -2. What is the observed
(real) salary of this employee?
Question 15 :
- Reference category
- Group 1
- Group 2
a- The value of the intercept (reference category) is:
b- The value of the b-coefficient associated with the dummy of group 1 is
c- The value of the b-coefficient associated with the dummy of group 2 is
d- If x is -1, the predicted y value for people in group 1 is:
Question 16:
In a study conducted on a sample of 200 employees, 45 reported that they are happy with their jobs.
Calculate the 95% confidence interval for the proportion of people in the entire workforce who are
not satisfied with their jobs.
Question 17:
This time, from the previous question take the proportion of employees who are satisfied so 45 out
of 200. The HR department wants to determine the proportion of employees across the entire
company who are satisfied with their jobs. They are planning to conduct a study using a random
sample and aim to find out the required sample size. The HR department is comfortable with a
margin of error of 5 percent points. How large should the sample be to estimate the proportion of
employees who are dissatisfied with their jobs? (Rounding errors will be accepted).
Question 18:
A chi-square test was conducted to examine the association between smoking status (smoker or non-
smoker) and the occurrence of a respiratory condition (yes or no). The chi-square statistic is 3.28
with 1 degree of freedom.
a- By using R, what is the associated p-value (2 decimals only):
b- Does this mean the association is significant at a level of 95%?
Question 19:
- Reference category
- Group 1
- Group 2
- The value of the intercept (reference category) is:
- The value of the b-coefficient associated with the dummy of group 1 is
- The value of the b-coefficient associated with the dummy of group 2 is
- If x is -2, the predicted y value for people in group 1 is:
Question 20: r output
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.8205 0.2914 -2.814 0.005 **
x1 0.5342 0.0335 15.936 < 2e-16 ***
x2 -0.2107 0.0276 -7.640 2.23e-12 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.18 on 238 degrees of freedom
Multiple R-squared: 0.4643, Adjusted R-squared: 0.4599
F-statistic: 103.3 on 2 and 238 DF, p-value: 0.125
a- Create a 95% confidence interval for the variable x1
b- Create a 95% confidence interval for the variable x2
c- Which of the following statements are correct?
a. The F value of this outcome indicates that the model estimated here is correct.
b. In this model all variables are associated with the dependent variable?
Question 21:
Following the launch of a new coffee blend, a café wants to assess the proportion of customers who
prefer the new blend over the old one. The café plans to conduct a study using a random sample and
aims to determine the necessary sample size. They are comfortable with a margin of error of 1
percent points. How large should the sample be to estimate the proportion of customers who favor
the new coffee blend? (Rounding errors will be accepted)
Question 22: True or False?
1. A correlation coefficient of -0.90 indicates a strong positive linear relationship between two
variables.
2. A parameter is a numerical characteristic of a sample, whereas a statistic is a numerical
characteristic of a population.
3. R squared represents the proportion of variance in the dependent variable explained by the
independent variables in a regression model.
4. A correlation coefficient of 0.20 suggests a weak positive linear relationship between two
variables.
5. A correlation coefficient of -1.00 implies a perfect negative linear relationship between two
variables.
6. The sampling distribution for proportions is used to make inferences about a population
proportion.
7. Adjusted Rsquared can be interpreted as the proportion of variance in the dependent
variable explained by the independent variables, accounting for the number of predictors
and sample size.
8. The standard error decreases as the sample size increases.
Question 23:
A researcher conducted a simple linear regression analysis to examine the relationship between
house prices (Var house_price) and one predictor variable: square footage (Var sq_footage). The
researcher obtained the following information:
- Variance of house prices (Var house_price) = 15.8
- Variance of square footage (Var sq_footage) = 20.2
- Variance of residuals (Var residuals) = 8.5
- Variance of predicted values (Var predicted) = 7.3
What is the unadjusted R square of the model?
Question 24:
A market researcher is examining the relationship between smartphone brands (measured with
three categories: Brand A, Brand B, Brand C) and the preferred feature in a smartphone (7 options
are considered) The researcher employs the chi-square statistic to assess the significance of this
relationship. How many degrees of freedom are associated with the chi-square statistic in this test?
Question 25: Linear equation
- Reference category
- Group 1
- Group 2
- Group 3
e- The value of the intercept (reference category) is:
f- The value of the b-coefficient associated with the dummy of group 1 is
g- The value of the b-coefficient associated with the dummy of group 2 is
h- The value of the b-coefficient associated with the dummy of group 3 is
i- If x is -2, the predicted y value for people in group 3 is:
j- If x is +1, the predicted y value for people in group 2 is:
Open questions:
Code to copy-paste:
library(tidyverse)
set.seed(789)
n <- 329
practice <- tibble(
gender = sample(c("Male", "Female"), n, replace = TRUE),
education_level = sample(c("High School", "Bachelor's", "Master's"), n, replace = TRUE),
age = sample(22:65, n, replace = TRUE),
employed = sample(c("Employed", "Unemployed"), n, replace = TRUE),
income = NA, # Placeholder for modified income values
motivation_before = round(runif(n, 1, 10), 1),
motivation_after = motivation_before + runif(n, 0, 3),
mental_health = round(runif(n, 0, 1000), 2),
regular_breaks = sample(c("Yes", "No"), n, replace = TRUE)
practice$income[practice$education_level == "High School"] <-
round(rnorm(sum(practice$education_level == "High School"), mean = 2800, sd = 1500), 2)
practice$income[practice$education_level == "Bachelor's"] <-
round(rnorm(sum(practice$education_level == "Bachelor's"), mean = 4800, sd = 1500), 2)
practice$income[practice$education_level == "Master's"] <-
round(rnorm(sum(practice$education_level == "Master's"), mean = 5200, sd = 1500), 2)
practice$mental_health[practice$employed == "Employed"] <-
rnorm(sum(practice$employed == "Employed"), mean = 500, sd = 120)
practice$mental_health[practice$employed == "Unemployed"] <-
rnorm(sum(practice$employed == "Unemployed"), mean = 350, sd = 73)
Question 1: A company wants to examine whether there is a significant difference in mental health
scores between employed and unemployed citizens. The mental health scores are measured on a
scale from 0 to 1000, and the company is particularly interested in understanding if there is evidence
of a substantial difference in the average mental health scores for these two groups.
a. Which test should be used? And why?
Welch t-test
b. Upload a screenshot displaying the relevant output.
c. Upload a screenshot of the commands or steps you would use to perform the statistical
analysis (even if you were unable to execute the analysis).
d. Perform the test and draw conclusions about the relationship between employment status
and mental health.
Using p- value: Since p-value <0.05, we can reject the null hypothesis and conclude that
there is significant relationship between employment status and mental health.
Using sample estimates: The average score is 485.5158 for employed group on mental
health, and the average score is 340.7864 for unemployed group.
e. Discuss the confidence interval.
Using confidence interval, since 0 is not included inside the confidence interval, we can say that in
95% of the cases the difference in mental health between two groups is not going to be zero. The
difference is between 124 to 165.
Question 2: A wellness center provides various well-being workshops to its clients, including
Mindfulness (Program A), Yoga (Program B), Nutrition (Program C), Fitness (Program D), and Stress
Management (Program E). The center aims to investigate whether the participation of clients in
these workshops aligns with the expected distribution based on their preferences. A random sample
of clients is selected, and their workshop preferences are recorded. The center suspects that the
distribution of workshop preferences in the sample may differ from the expected distribution.
a- Which test is used to answer this question and why?
Goodness of fit, because I have sample proportions for 5 groups(nominal variables) and I
want to compare the observed statistics with the expected ones.
b- Upload a screenshot displaying the relevant output of the test.
c- Upload a screenshot of the commands used to perform the test (even if you were unable to
execute the test).
d- Based on the output, what do you conclude about the suspects of the center?
Using the p- value, p-value<0.05, so we can reject the null hypothesis and conclude that the
distribution of workshop preferences in the sample may differ from the expected distribution.
it is unlikely that the data come from a random sample in the population
*There are significant effect between the observed and the expected.
Question 3: A group of researchers are interested in understanding whether there are significant
differences in the income levels of employees(DV) based on their educational backgrounds(IV). It is
believed that people with different educational backgrounds may have varying income levels and
wants to validate this claim:
a. Which statistical test would you use to explore this and why?
Anova test (F-test) from lm because we want to compare differences among three education
level group.
b. Show (copy-paste) the commands or steps you would use to perform the statistical analysis
(even if you were unable to execute the analysis). Also, upload the output.
c. Based on your analysis, provide a conclusion regarding whether there is a significant
difference in income levels.
Using p-value<0.05, we can reject null hypothesis and conclude that education level should effect the
income level. In other words, there is a significant different in income levels between different level
of education level.
A reputable financial journal published a claim that individuals with a Master's degree tend to have
significantly higher incomes compared to individuals with only a Bachelor's degree.
d. Which statistical test would you use to answer this question and why?
Two sample t-test, because there are two groups(individual with a Master’s degree and
individual with only a Bachelor’s degree).
e. Show (copy-paste) the commands or steps you would use to perform the statistical analysis
(even if you were unable to execute the analysis). Also, upload the output.
I have no idea what to do with the output is NA...
f. Based on your analysis, provide a conclusion regarding the claim.
Question 4:
A wellness center is interested in understanding the prevalence of taking regular breaks among its
clients. Regular breaks are believed to have positive effects on both physical and mental well-being.
The center collects data through a survey, where individuals report whether they take regular breaks.
a- Describe shortly which statistical analysis would be appropriate to assess the prevalence of
taking regular breaks
Proportion test.
b- Explain in a few lines why you selected this analysis.
I am interested in the proportion of whether they are taking regular breaks.
c- Upload a screenshot displaying the relevant output
d- Upload a screenshot of the commands or steps you would use to perform the statistical
analysis (even if you were unable to execute the analysis).
e- Based on the output from the dataset
a. estimate the percentage of individuals who take regular breaks. 0.508
b. Provide the lower bound of the confidence interval for the estimated percentage
0.45
c. Provide the upper bound of the confidence interval for the estimated percentage
0.56
Question 5: The management of a company is curious about the impact of a recent wellness initiative
on the overall well-being of its employees. To assess this, the company collected data on the
motivation levels of employees both before and after the implementation of the initiative. The goal is
to determine if there is a statistically significant change in motivation levels.
a. Which test should be used? And why?
Paired sample t-test(one sample t-test), because we are interested in one
sample(employees).
b. Upload a screenshot displaying the relevant output.
c. Upload a screenshot of the commands or steps you would use to perform the statistical
analysis (even if you were unable to execute the analysis).
d. Based on your analysis, provide a conclusion regarding whether there is a significant change
in motivation levels after the holidays.
Using p-value, p- value<0.05, so we can reject null hypothesis, and conclude that there is a
significant change in motivation levels after the holidays.
e. Discuss the confidence interval.
since 0 is not included inside the confidence interval, we can say that in 95% of the cases the
difference in motivation levels between before and after implementation of the initiative is
not going to be 0. That difference is between 1.44and 1.63 within 95% confidence interval.
Question 6:
A company aims to investigate potential gender-based income disparities within its employed
workforce. To address this, the company gathered data on the income of male and female
employees. The objective is to discern if there is a statistically significant difference in the average
income between male and female employees.
a. Which test should be used? And why?
Two sample t-test, because I have two groups and their variance is equal.
b. Upload a screenshot displaying the relevant output.
c. Upload a screenshot of the commands or steps you would use to perform the statistical
analysis (even if you were unable to execute the analysis).
d. Based on your analysis, provide a conclusion regarding whether there is a significant
difference between the incomes levels between genders.
Since p-value> 0.05, so we can not reject null hypothesis. So there is no significant difference
between the incomes levels between genders.