0% found this document useful (0 votes)
27 views15 pages

Mock Exam - Appendix

The document examines factors that influence apartment prices in Ljubljana, Slovenia using data on 60 apartments. Linear regression shows that apartment age negatively influences price per square meter. Descriptive statistics are also calculated for variables like age, distance to city center, and price.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views15 pages

Mock Exam - Appendix

The document examines factors that influence apartment prices in Ljubljana, Slovenia using data on 60 apartments. Linear regression shows that apartment age negatively influences price per square meter. Descriptive statistics are also calculated for variables like age, distance to city center, and price.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

APPLIED DATA

ANALYSIS IN BUSINESS
WITH R
2024
FINAL EXAM

APPENDIX

Date: May 12, 2024 Instructions: Sign the Appendix and the
Time: 90 minut Exam Sheet. Clearly mark 1 answer for each
Allowed accessories: question. There will be no negative points for
- Pen. an incorrectly answered question.
- Edigs (liquid eraser).
- Calculator.
- »Open book« exam – all printed
literature

SURNAME:
NAME:
STUDENT ID:

1
APPENDIX 1
In your master's thesis, you examine Slovenians' attitudes toward protecting the environment and nature.
On the European Social Survey website, you found data on people's agreement with the statement
"Protecting the environment and nature is important," measured on a Likert scale from 1 ("Totally
applies to me") to 6 ("Not at all applies to me"). You are interested in whether there is a relationship
between attitudes toward environmental protection and the person's education. Among other things,
each person also indicated their general well-being on a scale from 1 ("Excellent") to 5 ("Very poor").

mydata <- read.table("./Appendix 1.csv", header=TRUE, sep=";", dec=",")


head(mydata, 10)
## ID Educ Envir Wellbeing
## 1 1 3 3 1
## 2 2 3 2 1
## 3 3 2 1 2
## 4 4 2 1 2
## 5 5 3 1 1
## 6 6 1 1 2
## 7 7 2 1 3
## 8 8 2 1 3
## 9 9 3 1 2
## 10 10 1 1 5

Description:
• ID: Person ID
• Educ: Level of education (1:Primary School, 2:High School, 3:University - 1st degree,
4:University - 2nd or 3rd degree).
• Envir: Protecting the environment and nature is important (1:Totally applies to me - 6:Not at all
applies to me).
• Wellbeing: 1:Excellent, 2:Good, 3:So-so, 4:Poor, 5:Very poor.
mydata$EducF <- factor(mydata$Educ,
levels = c(1, 2, 3, 4),
labels = c("Primary", "High", "1st degree",
"2nd or 3rd degree"))

library(psych)
psych::describeBy(mydata$Envir, mydata$EducF)
##
## Descriptive statistics by group
## group: Primary
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 230 2.03 0.95 2 1.91 1.48 1 6 5 1.24 2.12
## se
## X1 0.06
## ----------------------------------------------------
## group: High
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 647 1.85 0.9 2 1.71 1.48 1 6 5 1.7 4.35
## se
## X1 0.04
## ----------------------------------------------------
## group: 1st degree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 287 1.65 0.84 1 1.49 0 1 5 4 1.71 3.73
## se
## X1 0.05

2
## ----------------------------------------------------
## group: 2nd or 3rd degree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 59 1.46 0.6 1 1.39 0 1 3 2 0.88 -0.28
## se
## X1 0.08

psych::describeBy(mydata$Wellbeing, mydata$EducF)
##
## Descriptive statistics by group
## group: Primary
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 230 2.6 1.01 3 2.59 1.48 1 5 4 0.13 -0.6
## se
## X1 0.07
## ----------------------------------------------------
## group: High
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 647 2.24 0.88 2 2.2 1.48 1 5 4 0.43 -0.05
## se
## X1 0.03
## ----------------------------------------------------
## group: 1st degree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 287 2 0.76 2 1.97 0 1 4 3 0.38 -0.29
## se
## X1 0.04
## ----------------------------------------------------
## group: 2nd or 3rd degree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 59 1.83 0.7 2 1.8 1.48 1 3 2 0.23 -0.99
## se
## X1 0.09

library(car)
leveneTest(mydata$Envir, mydata$EducF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.2705 0.2831
## 1219

leveneTest(mydata$Wellbeing, mydata$EducF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 10.765 5.527e-07 ***
## 1219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3
library(dplyr)
library(rstatix)
mydata %>%
group_by(EducF) %>%
shapiro_test(Envir)
## EducF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Primary Envir 0.410 0.351
## 2 High Envir 0.345 0.292
## 3 1st degree Envir 0.214 0.115
## 4 2nd or 3rd degree Envir 0.345 0.219

library(dplyr)
library(rstatix)
mydata %>%
group_by(EducF) %>%
shapiro_test(Envir)
## EducF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Primary Envir 0.810 4.99e-16
## 2 High Envir 0.745 2.31e-30
## 3 1st degree Envir 0.714 7.63e-22
## 4 2nd or 3rd degree Envir 0.695 8.89e-10

results <- aov(Envir ~ EducF,


data = mydata)

summary(results)
## Df Sum Sq Mean Sq F value Pr(>F)
## EducF 3 27.1 9.02 11.57 1.77e-07 ***
## Residuals 1219 950.6 0.78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(effectsize)
eta_squared(results)
## For one-way between subjects designs, partial eta squared is
## equivalent to eta squared. Returning eta squared.
## # Effect Size for ANOVA
##
## Parameter | Eta2 | 95% CI
## -------------------------------
## EducF | 0.03 | [0.01, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].

pairwise.t.test(mydata$Envir, mydata$EducF,
p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: mydata$Envir and mydata$EducF
##
## Primary High 1st degree
## High 0.0085 - -
## 1st degree 1.1e-06 0.0012 -
## 2nd or 3rd degree 9.6e-06 0.0011 0.1316
##
## P value adjustment method: none

4
kruskal.test(Wellbeing ~ EducF,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Wellbeing by EducF
## Kruskal-Wallis chi-squared = 62.847, df = 3, p-value =
## 1.448e-13

5
APPENDIX 2
A student doing an internship in a real estate agency analyzed, among other things, the situation on the
market of apartments for sale in Ljubljana. He sampled 60 randomly selected apartments and obtained
the following data:
- Age: age of the apartment in years.
- Distance: distance of the apartment from the center of Ljubljana in km.
- PriceM2: Price per m² of apartment (in 1000 EUR).
- Parking: Does the apartment have its own parking place? (0:No, 1:Outdoor parking, 2:Garage)
He would like to investigate how different factors influence the price of an apartment.

mydata <- read.table("./Appendix 2.csv", header=TRUE, sep=";", dec=",")


head(mydata)
## ID Age Distance PriceM2 Parking
## 1 1 7 28 1.64 0
## 2 2 18 1 2.80 2
## 3 3 7 28 1.66 0
## 4 4 28 29 1.85 0
## 5 5 18 18 1.64 1
## 6 6 28 12 1.77 0

Description:
• ID
• Age: age of the apartment in years.
• Distance: distance of the apartment from the center of Ljubljana in km.
• PriceM2: Price per m² of apartment (in 1000 EUR).
• Parking: 0:No, 1:Outdoor parking, 2:Garage.
mydata$ParkingF <- factor(mydata$Parking,
levels = c(0, 1, 2),
labels = c("No", "Outdoor", "Garage"))

library(car)
scatterplot(y = mydata$PriceM2,
x = mydata$Age,
main = "Scatterplot",
smooth = FALSE,
ylab = "Price per m2",
xlab = "Age")

6
fit1 <- lm(PriceM2 ~ Age,
data = mydata)

summary(fit1)
##
## Call:
## lm(formula = PriceM2 ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6439 -0.2926 -0.0723 0.2384 0.7561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.215244 0.098579 22.472 <2e-16 ***
## Age -0.009518 0.004729 -2.013 0.0488 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3756 on 58 degrees of freedom
## Multiple R-squared: 0.06528, Adjusted R-squared: 0.04917
## F-statistic: 4.051 on 1 and 58 DF, p-value: 0.0488

anova(fit1)
## Analysis of Variance Table
##
## Response: PriceM2
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 0.5716 0.57155 4.0509 0.0488 *
## Residuals 58 8.1834 0.14109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(pastecs)
round(stat.desc(mydata[ , c(2, 3, 4)], basic = FALSE), 2)
## Age Distance PriceM2
## median 18.00 12.00 1.98
## mean 18.15 13.65 2.04
## SE.mean 1.34 1.50 0.05
## CI.mean.0.95 2.67 3.01 0.10
## var 106.94 135.82 0.15
## std.dev 10.34 11.65 0.39
## coef.var 0.57 0.85 0.19

table(mydata$ParkingF)
##
## No Outdoor Garage
## 24 25 11

7
scatterplotMatrix(mydata[ , c(2, 3, 4)], smooth = FALSE)

library("Hmisc")
rcorr (as.matrix(mydata[ , c(2, 3, 4)]))
## Age Distance PriceM2
## Age 1.00 0.09 -0.26
## Distance 0.09 1.00 -0.61
## PriceM2 -0.26 -0.61 1.00
##
## n= 60
##
## P
## Age Distance PriceM2
## Age 0.5131 0.0488
## Distance 0.5131 0.0000
## PriceM2 0.0488 0.0000

fit2 <- lm(PriceM2 ~ Age + Distance + ParkingF,


data = mydata)

mydata$StdResid <- round(rstandard(fit2),3)

shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95796, p-value = 0.03741

mydata$CooksD <- round(cooks.distance(fit2),3)

hist(mydata$CooksD,
xlab = "Cook's distance",
ylab = "Frequency",
main = "Histogram of Cook's distances")

8
head(mydata[order(mydata$StdResid), c("ID", "StdResid")])
## ID StdResid
## 53 53 -3.339
## 28 28 -1.508
## 5 5 -1.480
## 41 41 -1.279
## 36 36 -1.264
## 13 13 -1.025

head(mydata[order(-mydata$StdResid), c("ID", "StdResid")])


## ID StdResid
## 27 27 1.966
## 38 38 1.950
## 22 22 1.850
## 10 10 1.547
## 49 49 1.494
## 55 55 1.467

head(mydata[order(-mydata$CooksD), c("ID", "CooksD")])


## ID CooksD
## 53 53 0.270
## 38 38 0.242
## 55 55 0.091
## 22 22 0.089
## 27 27 0.048
## 33 33 0.047

mydata$StdFittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
smooth = FALSE,
regLine = FALSE)

9
library(dplyr)
mydataC <- mydata %>%
filter(!mydata$ID %in% c(53, 38))

fit1 <- lm(PriceM2 ~ Age,


data = mydataC)

fit2 <- lm(PriceM2 ~ Age + Distance + ParkingF,


data = mydataC)

vif(fit2)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.020226 1 1.010062
## Distance 1.314058 1 1.146323
## ParkingF 1.294013 2 1.066558

summary(fit2)
##
## Call:
## lm(formula = PriceM2 ~ Age + Distance + ParkingF, data = mydataC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33343 -0.15964 -0.05008 0.14972 0.48003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.243888 0.089969 24.941 < 2e-16 ***
## Age -0.006758 0.002939 -2.299 0.0255 *
## Distance -0.016835 0.003114 -5.406 1.57e-06 ***
## ParkingFOutdoor 0.151051 0.070478 2.143 0.0367 *
## ParkingFGarage 0.511416 0.094165 5.431 1.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2255 on 53 degrees of freedom
## Multiple R-squared: 0.6888, Adjusted R-squared: 0.6653
## F-statistic: 29.32 on 4 and 53 DF, p-value: 7.105e-13

10
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: PriceM2 ~ Age
## Model 2: PriceM2 ~ Age + Distance + ParkingF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 56 8.0266
## 2 53 2.6941 3 5.3326 34.969 1.321e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Statements related to the R output:

1. We use the function __I__ to check the assumption: 𝐻0 : ∆𝜌2 = 0.

2. On the basis of the multiple coefficient of determination, we conclude that the model fit2 can
explain __II__ of the variability of apartment’s prices.

3. Based on the multiple correlation coefficient, we conclude that the linear relationship between
__III__ is strong.

4. Based on the fit2 model, we estimate that a 13-year-old apartment, 10 km from the center of
Ljubljana and an outdoor parking lot, will cost __IV__ EUR per square meter.

11
APPENDIX 3
The first midterm exam in Statistics was written by 324 students in the year 2021/2022. The professor
wanted to know whether the result depended on whether the students had completed their undergraduate
studies at the Faculty of Economics in Ljubljana or at another faculty. For this purpose, he divided the
students into two groups according to whether they had studied at the School of Economics and
Business in Ljubljana (SEB) or at another faculty and according to whether they had scored up to (and
including) 11 points or more.

mydata <- read.table("./Appendix 3.csv", header=TRUE, sep=";", dec=",")

mydata$ResultF <- factor(mydata$Result,


levels = c(1, 2),
labels = c("Up to 11 points",
"More than 11 points"))

mydata$FacultyF <- factor(mydata$Faculty,


levels = c(1, 2),
labels = c("SEB", "Other"))

head(mydata)
## Result Faculty ResultF FacultyF
## 1 1 1 Up to 11 points SEB
## 2 1 1 Up to 11 points SEB
## 3 1 1 Up to 11 points SEB
## 4 1 1 Up to 11 points SEB
## 5 1 1 Up to 11 points SEB
## 6 1 1 Up to 11 points SEB

table(mydata$ResultF, mydata$FacultyF)
##
## SEB Other
## Up to 11 points 106 62
## More than 11 points 105 51

hi_square <- chisq.test(x = mydata$ResultF,


y = mydata$FacultyF,
correct = TRUE)

hi_square
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$ResultF and mydata$FacultyF
## X-squared = 0.4601, df = 1, p-value = 0.4976

addmargins(hi_square$observed)
## mydata$FacultyF
## mydata$ResultF SEB Other Sum
## Up to 11 points 106 62 168
## More than 11 points 105 51 156
## Sum 211 113 324

12
addmargins(round(hi_square$expected, 2))
## mydata$FacultyF
## mydata$ResultF SEB Other Sum
## Up to 11 points 109.41 58.59 168
## More than 11 points 101.59 54.41 156
## Sum 211.00 113.00 324

library(DescTools)
CramerV(mydata$ResultF, mydata$FacultyF)
## [1] 0.04416433

13
APPENDIX 4
You study how the productivity of people is affected by the organization of work.

mydata <- read.table("./Productivity.csv", header=TRUE, sep=";", dec=",")

head(mydata)
## Gender Age Org1 Org2 Org3 Org4 Org5 Org6 Org7 Org8 Org9 Org10 Org11
## 1 0 19 6 4 4 5 6 4 6 5 5 4 5
## 2 0 21 5 3 3 5 5 5 5 5 5 5 5
## 3 0 21 6 4 2 5 5 5 5 2 3 5 3
## 4 0 21 7 4 4 7 6 7 7 5 6 7 3
## 5 0 21 7 4 5 6 6 6 6 4 5 6 6
## 6 0 22 5 6 5 5 5 5 5 3 4 5 4
##
## Org12 Org13
## 1 6 4
## 2 5 3
## 3 4 3
## 4 7 4
## 5 6 5
## 6 5 4

Description of variables (variables Org1 - Org13 are measured on a Likert scale between 1 and 7, where a
value of 1 means “I fully agree” and a value of 7 means “I strongly disagree”):
• Gender: 0 = F, 1 = M
• Age: Age in years
• Org1: I like to know what to do that day
• Org2: I am not fond of non-organized people
• Org3: I always do everything last minute
• Org4: I like to have well organized documents
• Org5: I act badly in chaotic situations
• Org6: My working space is clean and organized
• Org7: I like to be organized
• Org8: I feel like wasting my time
• Org9: I am forgetting my plans
• Org10: I like to work in an organized environment
• Org11: I aimlessly try to do many tasks
• Org12: I have troubles with planning
• Org13: I postpone my tasks to tomorrow

mydata$GenderF <- factor(mydata$Gender,


levels = c(0, 1),
labels = c("F", "M"))

library(pastecs)
round(stat.desc(mydata[ , -16], basic=FALSE), 2)
## Gender Age Org1 Org2 Org3 Org4 Org5 Org6 Org7 Org8
## median 0.00 34.50 5.00 4.00 4.00 6.00 5.00 5.00 6.00 5.00
## mean 0.21 36.55 5.31 4.38 4.04 5.42 4.92 4.56 5.45 4.72
## SE.mean 0.03 0.70 0.09 0.10 0.11 0.09 0.11 0.13 0.09 0.10
## CI.mean.0.95 0.06 1.38 0.17 0.20 0.22 0.18 0.21 0.25 0.18 0.19
## var 0.17 97.27 1.56 2.16 2.40 1.68 2.34 3.17 1.66 1.88
## std.dev 0.41 9.86 1.25 1.47 1.55 1.30 1.53 1.78 1.29 1.37
## coef.var 1.94 0.27 0.24 0.34 0.38 0.24 0.31 0.39 0.24 0.29

14
## Org9 Org10 Org11 Org12 Org13
## median 5.00 6.00 5.00 5.00 4.00
## mean 4.96 5.36 4.88 4.87 4.04
## SE.mean 0.09 0.09 0.09 0.10 0.09
## CI.mean.0.95 0.17 0.18 0.19 0.19 0.18
## var 1.52 1.72 1.80 1.92 1.70
## std.dev 1.23 1.31 1.34 1.39 1.30
## coef.var 0.25 0.24 0.28 0.28 0.32

library(psych)
describeBy(mydata$Age, mydata$GenderF)
##
## Descriptive statistics by group
## group: F
## vars n mean sd median trimmed mad min max range skew
## X1 1 158 35.58 9.28 34 35.07 10.38 19 59 40 0.44
## kurtosis se
## X1 -0.62 0.74
## ----------------------------------------------------
## group: M
## vars n mean sd median trimmed mad min max range skew
## X1 1 42 40.21 11.17 38.5 39.06 11.12 24 70 46 0.82
## kurtosis se
## X1 0 1.72

library(ggplot2)
ggplot(mydata, aes(y = Org1, x = GenderF)) +
geom_boxplot()

15

You might also like