Mock Exam - Appendix
Mock Exam - Appendix
ANALYSIS IN BUSINESS
WITH R
2024
FINAL EXAM
APPENDIX
Date: May 12, 2024 Instructions: Sign the Appendix and the
Time: 90 minut Exam Sheet. Clearly mark 1 answer for each
Allowed accessories: question. There will be no negative points for
- Pen. an incorrectly answered question.
- Edigs (liquid eraser).
- Calculator.
- »Open book« exam – all printed
literature
SURNAME:
NAME:
STUDENT ID:
1
APPENDIX 1
In your master's thesis, you examine Slovenians' attitudes toward protecting the environment and nature.
On the European Social Survey website, you found data on people's agreement with the statement
"Protecting the environment and nature is important," measured on a Likert scale from 1 ("Totally
applies to me") to 6 ("Not at all applies to me"). You are interested in whether there is a relationship
between attitudes toward environmental protection and the person's education. Among other things,
each person also indicated their general well-being on a scale from 1 ("Excellent") to 5 ("Very poor").
Description:
• ID: Person ID
• Educ: Level of education (1:Primary School, 2:High School, 3:University - 1st degree,
4:University - 2nd or 3rd degree).
• Envir: Protecting the environment and nature is important (1:Totally applies to me - 6:Not at all
applies to me).
• Wellbeing: 1:Excellent, 2:Good, 3:So-so, 4:Poor, 5:Very poor.
mydata$EducF <- factor(mydata$Educ,
levels = c(1, 2, 3, 4),
labels = c("Primary", "High", "1st degree",
"2nd or 3rd degree"))
library(psych)
psych::describeBy(mydata$Envir, mydata$EducF)
##
## Descriptive statistics by group
## group: Primary
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 230 2.03 0.95 2 1.91 1.48 1 6 5 1.24 2.12
## se
## X1 0.06
## ----------------------------------------------------
## group: High
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 647 1.85 0.9 2 1.71 1.48 1 6 5 1.7 4.35
## se
## X1 0.04
## ----------------------------------------------------
## group: 1st degree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 287 1.65 0.84 1 1.49 0 1 5 4 1.71 3.73
## se
## X1 0.05
2
## ----------------------------------------------------
## group: 2nd or 3rd degree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 59 1.46 0.6 1 1.39 0 1 3 2 0.88 -0.28
## se
## X1 0.08
psych::describeBy(mydata$Wellbeing, mydata$EducF)
##
## Descriptive statistics by group
## group: Primary
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 230 2.6 1.01 3 2.59 1.48 1 5 4 0.13 -0.6
## se
## X1 0.07
## ----------------------------------------------------
## group: High
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 647 2.24 0.88 2 2.2 1.48 1 5 4 0.43 -0.05
## se
## X1 0.03
## ----------------------------------------------------
## group: 1st degree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 287 2 0.76 2 1.97 0 1 4 3 0.38 -0.29
## se
## X1 0.04
## ----------------------------------------------------
## group: 2nd or 3rd degree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 59 1.83 0.7 2 1.8 1.48 1 3 2 0.23 -0.99
## se
## X1 0.09
library(car)
leveneTest(mydata$Envir, mydata$EducF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.2705 0.2831
## 1219
leveneTest(mydata$Wellbeing, mydata$EducF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 10.765 5.527e-07 ***
## 1219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3
library(dplyr)
library(rstatix)
mydata %>%
group_by(EducF) %>%
shapiro_test(Envir)
## EducF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Primary Envir 0.410 0.351
## 2 High Envir 0.345 0.292
## 3 1st degree Envir 0.214 0.115
## 4 2nd or 3rd degree Envir 0.345 0.219
library(dplyr)
library(rstatix)
mydata %>%
group_by(EducF) %>%
shapiro_test(Envir)
## EducF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Primary Envir 0.810 4.99e-16
## 2 High Envir 0.745 2.31e-30
## 3 1st degree Envir 0.714 7.63e-22
## 4 2nd or 3rd degree Envir 0.695 8.89e-10
summary(results)
## Df Sum Sq Mean Sq F value Pr(>F)
## EducF 3 27.1 9.02 11.57 1.77e-07 ***
## Residuals 1219 950.6 0.78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(effectsize)
eta_squared(results)
## For one-way between subjects designs, partial eta squared is
## equivalent to eta squared. Returning eta squared.
## # Effect Size for ANOVA
##
## Parameter | Eta2 | 95% CI
## -------------------------------
## EducF | 0.03 | [0.01, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
pairwise.t.test(mydata$Envir, mydata$EducF,
p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: mydata$Envir and mydata$EducF
##
## Primary High 1st degree
## High 0.0085 - -
## 1st degree 1.1e-06 0.0012 -
## 2nd or 3rd degree 9.6e-06 0.0011 0.1316
##
## P value adjustment method: none
4
kruskal.test(Wellbeing ~ EducF,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Wellbeing by EducF
## Kruskal-Wallis chi-squared = 62.847, df = 3, p-value =
## 1.448e-13
5
APPENDIX 2
A student doing an internship in a real estate agency analyzed, among other things, the situation on the
market of apartments for sale in Ljubljana. He sampled 60 randomly selected apartments and obtained
the following data:
- Age: age of the apartment in years.
- Distance: distance of the apartment from the center of Ljubljana in km.
- PriceM2: Price per m² of apartment (in 1000 EUR).
- Parking: Does the apartment have its own parking place? (0:No, 1:Outdoor parking, 2:Garage)
He would like to investigate how different factors influence the price of an apartment.
Description:
• ID
• Age: age of the apartment in years.
• Distance: distance of the apartment from the center of Ljubljana in km.
• PriceM2: Price per m² of apartment (in 1000 EUR).
• Parking: 0:No, 1:Outdoor parking, 2:Garage.
mydata$ParkingF <- factor(mydata$Parking,
levels = c(0, 1, 2),
labels = c("No", "Outdoor", "Garage"))
library(car)
scatterplot(y = mydata$PriceM2,
x = mydata$Age,
main = "Scatterplot",
smooth = FALSE,
ylab = "Price per m2",
xlab = "Age")
6
fit1 <- lm(PriceM2 ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = PriceM2 ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6439 -0.2926 -0.0723 0.2384 0.7561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.215244 0.098579 22.472 <2e-16 ***
## Age -0.009518 0.004729 -2.013 0.0488 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3756 on 58 degrees of freedom
## Multiple R-squared: 0.06528, Adjusted R-squared: 0.04917
## F-statistic: 4.051 on 1 and 58 DF, p-value: 0.0488
anova(fit1)
## Analysis of Variance Table
##
## Response: PriceM2
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 0.5716 0.57155 4.0509 0.0488 *
## Residuals 58 8.1834 0.14109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(pastecs)
round(stat.desc(mydata[ , c(2, 3, 4)], basic = FALSE), 2)
## Age Distance PriceM2
## median 18.00 12.00 1.98
## mean 18.15 13.65 2.04
## SE.mean 1.34 1.50 0.05
## CI.mean.0.95 2.67 3.01 0.10
## var 106.94 135.82 0.15
## std.dev 10.34 11.65 0.39
## coef.var 0.57 0.85 0.19
table(mydata$ParkingF)
##
## No Outdoor Garage
## 24 25 11
7
scatterplotMatrix(mydata[ , c(2, 3, 4)], smooth = FALSE)
library("Hmisc")
rcorr (as.matrix(mydata[ , c(2, 3, 4)]))
## Age Distance PriceM2
## Age 1.00 0.09 -0.26
## Distance 0.09 1.00 -0.61
## PriceM2 -0.26 -0.61 1.00
##
## n= 60
##
## P
## Age Distance PriceM2
## Age 0.5131 0.0488
## Distance 0.5131 0.0000
## PriceM2 0.0488 0.0000
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95796, p-value = 0.03741
hist(mydata$CooksD,
xlab = "Cook's distance",
ylab = "Frequency",
main = "Histogram of Cook's distances")
8
head(mydata[order(mydata$StdResid), c("ID", "StdResid")])
## ID StdResid
## 53 53 -3.339
## 28 28 -1.508
## 5 5 -1.480
## 41 41 -1.279
## 36 36 -1.264
## 13 13 -1.025
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
smooth = FALSE,
regLine = FALSE)
9
library(dplyr)
mydataC <- mydata %>%
filter(!mydata$ID %in% c(53, 38))
vif(fit2)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.020226 1 1.010062
## Distance 1.314058 1 1.146323
## ParkingF 1.294013 2 1.066558
summary(fit2)
##
## Call:
## lm(formula = PriceM2 ~ Age + Distance + ParkingF, data = mydataC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33343 -0.15964 -0.05008 0.14972 0.48003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.243888 0.089969 24.941 < 2e-16 ***
## Age -0.006758 0.002939 -2.299 0.0255 *
## Distance -0.016835 0.003114 -5.406 1.57e-06 ***
## ParkingFOutdoor 0.151051 0.070478 2.143 0.0367 *
## ParkingFGarage 0.511416 0.094165 5.431 1.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2255 on 53 degrees of freedom
## Multiple R-squared: 0.6888, Adjusted R-squared: 0.6653
## F-statistic: 29.32 on 4 and 53 DF, p-value: 7.105e-13
10
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: PriceM2 ~ Age
## Model 2: PriceM2 ~ Age + Distance + ParkingF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 56 8.0266
## 2 53 2.6941 3 5.3326 34.969 1.321e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2. On the basis of the multiple coefficient of determination, we conclude that the model fit2 can
explain __II__ of the variability of apartment’s prices.
3. Based on the multiple correlation coefficient, we conclude that the linear relationship between
__III__ is strong.
4. Based on the fit2 model, we estimate that a 13-year-old apartment, 10 km from the center of
Ljubljana and an outdoor parking lot, will cost __IV__ EUR per square meter.
11
APPENDIX 3
The first midterm exam in Statistics was written by 324 students in the year 2021/2022. The professor
wanted to know whether the result depended on whether the students had completed their undergraduate
studies at the Faculty of Economics in Ljubljana or at another faculty. For this purpose, he divided the
students into two groups according to whether they had studied at the School of Economics and
Business in Ljubljana (SEB) or at another faculty and according to whether they had scored up to (and
including) 11 points or more.
head(mydata)
## Result Faculty ResultF FacultyF
## 1 1 1 Up to 11 points SEB
## 2 1 1 Up to 11 points SEB
## 3 1 1 Up to 11 points SEB
## 4 1 1 Up to 11 points SEB
## 5 1 1 Up to 11 points SEB
## 6 1 1 Up to 11 points SEB
table(mydata$ResultF, mydata$FacultyF)
##
## SEB Other
## Up to 11 points 106 62
## More than 11 points 105 51
hi_square
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$ResultF and mydata$FacultyF
## X-squared = 0.4601, df = 1, p-value = 0.4976
addmargins(hi_square$observed)
## mydata$FacultyF
## mydata$ResultF SEB Other Sum
## Up to 11 points 106 62 168
## More than 11 points 105 51 156
## Sum 211 113 324
12
addmargins(round(hi_square$expected, 2))
## mydata$FacultyF
## mydata$ResultF SEB Other Sum
## Up to 11 points 109.41 58.59 168
## More than 11 points 101.59 54.41 156
## Sum 211.00 113.00 324
library(DescTools)
CramerV(mydata$ResultF, mydata$FacultyF)
## [1] 0.04416433
13
APPENDIX 4
You study how the productivity of people is affected by the organization of work.
head(mydata)
## Gender Age Org1 Org2 Org3 Org4 Org5 Org6 Org7 Org8 Org9 Org10 Org11
## 1 0 19 6 4 4 5 6 4 6 5 5 4 5
## 2 0 21 5 3 3 5 5 5 5 5 5 5 5
## 3 0 21 6 4 2 5 5 5 5 2 3 5 3
## 4 0 21 7 4 4 7 6 7 7 5 6 7 3
## 5 0 21 7 4 5 6 6 6 6 4 5 6 6
## 6 0 22 5 6 5 5 5 5 5 3 4 5 4
##
## Org12 Org13
## 1 6 4
## 2 5 3
## 3 4 3
## 4 7 4
## 5 6 5
## 6 5 4
Description of variables (variables Org1 - Org13 are measured on a Likert scale between 1 and 7, where a
value of 1 means “I fully agree” and a value of 7 means “I strongly disagree”):
• Gender: 0 = F, 1 = M
• Age: Age in years
• Org1: I like to know what to do that day
• Org2: I am not fond of non-organized people
• Org3: I always do everything last minute
• Org4: I like to have well organized documents
• Org5: I act badly in chaotic situations
• Org6: My working space is clean and organized
• Org7: I like to be organized
• Org8: I feel like wasting my time
• Org9: I am forgetting my plans
• Org10: I like to work in an organized environment
• Org11: I aimlessly try to do many tasks
• Org12: I have troubles with planning
• Org13: I postpone my tasks to tomorrow
library(pastecs)
round(stat.desc(mydata[ , -16], basic=FALSE), 2)
## Gender Age Org1 Org2 Org3 Org4 Org5 Org6 Org7 Org8
## median 0.00 34.50 5.00 4.00 4.00 6.00 5.00 5.00 6.00 5.00
## mean 0.21 36.55 5.31 4.38 4.04 5.42 4.92 4.56 5.45 4.72
## SE.mean 0.03 0.70 0.09 0.10 0.11 0.09 0.11 0.13 0.09 0.10
## CI.mean.0.95 0.06 1.38 0.17 0.20 0.22 0.18 0.21 0.25 0.18 0.19
## var 0.17 97.27 1.56 2.16 2.40 1.68 2.34 3.17 1.66 1.88
## std.dev 0.41 9.86 1.25 1.47 1.55 1.30 1.53 1.78 1.29 1.37
## coef.var 1.94 0.27 0.24 0.34 0.38 0.24 0.31 0.39 0.24 0.29
14
## Org9 Org10 Org11 Org12 Org13
## median 5.00 6.00 5.00 5.00 4.00
## mean 4.96 5.36 4.88 4.87 4.04
## SE.mean 0.09 0.09 0.09 0.10 0.09
## CI.mean.0.95 0.17 0.18 0.19 0.19 0.18
## var 1.52 1.72 1.80 1.92 1.70
## std.dev 1.23 1.31 1.34 1.39 1.30
## coef.var 0.25 0.24 0.28 0.28 0.32
library(psych)
describeBy(mydata$Age, mydata$GenderF)
##
## Descriptive statistics by group
## group: F
## vars n mean sd median trimmed mad min max range skew
## X1 1 158 35.58 9.28 34 35.07 10.38 19 59 40 0.44
## kurtosis se
## X1 -0.62 0.74
## ----------------------------------------------------
## group: M
## vars n mean sd median trimmed mad min max range skew
## X1 1 42 40.21 11.17 38.5 39.06 11.12 24 70 46 0.82
## kurtosis se
## X1 0 1.72
library(ggplot2)
ggplot(mydata, aes(y = Org1, x = GenderF)) +
geom_boxplot()
15