HANOI UNIVERSITY
FACULTY OF MANAGEMENT AND TOURISM
-------oOo------
CASE STUDY
BUSINESS PERFORMANCE
Course: Business and Economics Statistics
Tutorial class: Tut 3 - Group 2
Tutor’s name: Mrs. Trần Thị Thu Hiền
Group members: Trần Thị Mai Hương - 2104000050
Trần Thùy Anh - 2204040013
Nguyễn Hồng Nhung- 2104050037
Nguyễn Phương Thanh - 2104000096
Nguyễn Thùy Linh - 2104000058
Phạm Kỳ Thái – 2204050066
Vũ Huyền Linh – 2004000057
Hanoi, November 3rd 2023
TABLE OF CONTENT
A. SCENARIO .......................................................................................................................... 1
B. QUESTIONS ........................................................................................................................ 1
C. ANSWER .............................................................................................................................. 2
Question 1: Descriptive statistics for the dataset .................................................................... 2
Question 2: One-Way ANOVA Testing for significant differences due to province .............. 7
Question 3. Two-way ANOVA test for province, types of ownership, and interaction. ....... 10
Question 4: Discuss the credibility of the interpretations and conclusions of these tests. Is
there anything we should be concerned about? Explain. ...................................................... 14
Question 5: Base on your dataset, make your own problem using simple / multiple linear
regression. Interpret the output. ............................................................................................ 14
D. PEER EVALUATION FORM .......................................................................................... 17
A. SCENARIO
The database of The Viet Nam Small and Medium Enterprises (SME) is an important source
of data for any scholars doing research on Vietnam economy and its micro dynamics. In
2015, the survey was carried out with a sample size over 2500 enterprises from nine
provinces across the country. The survey instrument consists of three modules: (i) a main
enterprise questionnaire for owners or managers; (ii) an employee questionnaire
administered to a random subset of employees in a quarter of randomly selected enterprises;
and (iii) an economic accounts module. In the survey, business were asked to:
• Specify address of firm: Hanoi, Haiphong, TP HCM. (province)
• Ownership status: One owner, Multiple owners. (own)
• Quantity produced for the most important product (in revenue terms). (quantityproduct)
• Quantity sold base one quantity produced for the most important product. (quantitysold)
• Total assets in 2014 (end-year) (million VND) (in market value). (totalass)
A portion of the SME data is to be given to each group by your tutor.
B. QUESTIONS
1. Produce descriptive statistics to summarize the data. You are expected to generate as
many relevant descriptive statistics as possible using ALL the relevant tools introducedin
the labs of this course. Remember to provide appropriate interpretations for the descriptive
statistics. Try not to include unnecessary or irrelevant descriptive statistics.
2. Use analysis of variance to test for any significant differences due to province. Use a .05
level of significance, and for now, ignore the effect of types of ownership, quantity produced
and quantity sold. Check all the assumptions of the inference technique you use. Are the
assumptions satisfied? Explain.
3. At the .05 level of significance test for any significant differences due to province, types
of ownership, and interaction (ignore the effect of quantity produced and quantity sold.
Check all the assumptions of the inference technique you use. Are the assumptions satisfied?
Explain. Draw an interaction plot and interpret the plot. Is the plot consistent with the
conclusions?
4. Discuss the credibility of the interpretations and conclusions of these tests. Is there
anything we should be concerned about? Explain.
5. Base on your dataset, make your own problem using simple/multiple linear regression.
Interpret the output.
1
C. ANSWER
Question 1: Descriptive statistics for the dataset
Particularly in this report, our group uses RStudio to produce descriptive statistics. First
and foremost, setting the working directory then import the Excel file “dataset23.csv” into
R, assigning it to “mydata" for further calculation:
> mydata<-read.table("dataset23.csv",header = TRUE, sep =",",quote
="/",stringsAsFactors = FALSE)
Because there is a lot of data in this case study, we utilize R's head( ) function to explore
some of the initial observations and learn more about this dataset:
> head(mydata)
The str( ) function can be used to obtain the internal structure of the data. The output and
code are provided below:
> str(mydata)
According to the above output, it is clear that there are 300 observations with 5 variables
which are: Province, Own, Quantity product, Quantity sold and Total assets
To follow a few graphical or statistical methods, we need to convert variables into factors
by using codes below:
> mydata$province <- factor (mydata$province, levels = c ("Hanoi","Haiphong","TP
HCM"))
> mydata$own<-factor (mydata$own, levels = c("One-owner","Multi-owner"))
Then, we utilize the str( ) function to ensure that the province and own variables in our data
file are transformed into factors, resulting in a new structure:
> str(mydata)
2
Next, we use the following table( ) function style to generate a contingency table so that we
can view the sample size of each group:
> table(mydata$own, mydata$province)
All 6 treatment groups show the same sample size of 50 in the output that is displayed. To
examine the association between the variables in the dataset, the two-way ANOVA test
would be the most appropriate option.
Following that, the by( ) function is utilized to obtain a number of descriptive statistics for
each group specified by the factors and their respective output, including the mean, median,
and standard deviation:
> by (mydata$totalass,list(mydata$own,mydata$province),mean)
In terms of mean, TP HCM Multi-owner has the greatest total assets (18888.82), the
smallest belongs to one-owner in Hanoi (4491.1)
> by(mydata$totalass,list(mydata$own,mydata$province),median)
3
Looking at the output above, we can see the variation among median of the samples. The
table reveals that the total assets of TP HCM Multi-owner is the highest with 7517.5, while
Hanoi One-owner has the lowest one (1867.5)
> by(mydata$totalass,list(mydata$own,mydata$province),sd)
When it comes to standard deviation, the one-owner in Hai Phong has the highest total
assets (57826.06), while the one-owner in TP HCM has the lowest total assets (5207.662).
> by(mydata$totalass,list(mydata$own,mydata$province),summary)
4
Six fundamental statistics are found in the final code summary in addition to the total
amount: the minimum value, the first quantile, the median, the mean, the third quartile, and
the maximum value.
Since the box plot evaluates the form and distribution of the data and highlights any outliers,
we use it to obtain additional information. The dataset has some large outliers, which causes
the cells to constrict and get smaller. For improved observation and illustration, we therefore
add ylim to the boxplot() function. The following code can be used to graph the box plot:
> boxplot(totalass~interaction(mydata$province,mydata$own), data=mydata,
xlab= "Province and Owner", ylab= "Total Asset ", ylim=c(500,50000),
col =c("blue","yellow","pink" ,"red" ,"orange" ,"navy"))
5
It is apparent that the box plot presents a number of descriptive statistics, such as the six
groups' median, quartile range, maximum and lowest data.
The R output shows that the median values of the six groups barely differ from one another.
Furthermore, a boxplot makes it clear how skewed each group is. Each group's skewness
can be positively, negatively, or normally distributed depending on how far the median is
from quartiles 1 and 3. The majority of groups have a positive skew. We can find the distance
between the end point and the median to determine the degree of distribution skewness. The
six categories shown in the plots above can all be classified as right-skewed since the
median is closer to the peak. TP HCM Multi-owner as having the highest variance, closely
followed by Multi-owner Hai Phong. The findings indicate that every business, including
Hai Phong One-owner, the box with the lowest variance, has more total assets than the
average of the groupings.
We use mean plot to identify each group's mean value and compare means between groups
using the following codes, and the result is:
> install.packages("gplots")
> library(gplots)
> plotmeans (mydata$totalass~interaction(mydata$province,mydata
$own),data=mydata, xlab = "Province and Type of ownership", ylab="Total Asset",
main = "Mean Plot + with 95% CI")
6
The mean plot indicates that six groups are represented with 95% confidence intervals. TP
HCM Multi-owner has the highest mean (18888.82), followed by Hai Phong Multi-Owner
(15788.84). Hanoi One-owner has the lowest rating (4491.1).
Question 2: One-Way ANOVA Testing for significant differences due to
province
The choice of one-way ANOVA is appropriate because we have one outcome variable
(totalass) and one factor variable (province) with three levels: Hai Phong, Ha Noi, TP HCM.
However, before conducting the analysis, we need to check the assumptions of the one-way
ANOVA method.
Step 1: Define hypothesis
Ho: There is no significant differences in total asset due to province
Ha: There is a significant difference in total asset due to province
Step 2: Check assumption
1. Independent, simple random samples
Firstly, to determine whether the samples are independent and selected by SRS or not, we
apply table () function in RStudio, and because we ignore the effect of types of ownership,
quantity produced and quantity sold, so the function is: > table(mydata$province)
7
The result shows that each province contains 100 samples, implying that all sample sizes
are equal. As a result, we could claim that the samples are independent and randomly
selected.
→ Independent, simple random samples: Satisfied
2. All populations are normally distributed
Secondly, we use the Q-Q plot as a graphical tool for a visual assessment of population
normality. By installing and loading the “car” package first, we can ensure that the
necessary functions and dependencies are available to generate Q-Q plots using the qqPlot()
function:
> install.packages("car")
> library(car)
> qqPlot(lm(totalass ~ province, data = mydata), simulate = T,main = "Q-Q Plot",
labels = F)
As we can observe from the graph, the majority of data points do not fall along a straight
line and there are still some outliers, so the result suggests that the populations are not
normally distributed. However, to make sure about this decision, we draw a histogram for
population distribution by coding histogram function:
> hist(mydata$totalass, main = "Histogram", xlab = "Total ass")
8
It can be easily seen that our histogram is right-skewed, it comes to a conclusion that the
populations are not normally distributed.
→ All populations are normally distributed: Dissatisfied
3. All populations have the same standard deviation
Finally, the function by() is chosen to check the assumption that all populations have the
same standard deviation because it allows us to calculate the standard deviation of the
totalass variable within each level of the province variable in the mydata dataset.
The function and the results are followed by:
> by(mydata$totalass,mydata$province,sd)
From the output, we can withdraw the ratio between the largest standard deviation and the
smallest standard deviation equal to:
𝑆𝑙𝑎𝑟𝑔𝑒𝑠𝑡 44699.1
= 7591.148 = 5.888319 > 2
𝑆𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡
Its ratio is larger than 2, so the populations do not have the same standard deviation which
leads us to the conclusion that this test's assumptions are not all satisfied. However, we still
follow one-way ANOVA and consider these unsatisfied assumptions as limitation of our
project.
→ All populations have the same standard deviation: Dissatisfied
Step 3: Test statistic
9
The R output below shows the statistics and p - value:
> aovQues2 <- aov(totalass~province, data=Ques2)
> summary(aovQues2)
Step 4: Level of significant: α = 0.05
Step 5: Decision Rule
P-value approach: Reject Ho if p-value < α
According to the R output shown above, p-value = 0.138, which is greater than the level of
significant α = 0.05. As a result, we do not reject Ho and the test is not significant.
Step 6: Conclusion
At the 0.05 level of significance, we do not have enough evidence to conclude that there is
a significant difference in total assets due to province.
Question 3. Two-way ANOVA test for province, types of ownership, and
interaction.
The question asks for test the effect of 2 factors (province and type of ownership) on total
assets. Therefore, we decided to use the two-way ANOVA method to examine the data.
Step 1: Define hypothesis
Ho1: Province has no effect on total assets
Ha1: Province has effect on total assets
Ho2: Ownership has no effect on total assets
Ha2: Ownership has effect on total assets
Ho3: There is no interaction between province and types of ownership
Ha3: There is significant interaction between province and types of ownership
Step 2: Check assumptions
Assumption 1. Samples are independent, simple random samples of size from each
population.
To check this assumption, we selected enterprises from the sample types of ownership,
which has no direct effect on the sample Province. As a result, the study has independent
samples.
Furthermore, regardless of province or types of ownership, every firm in the two samples
has an equal probability of being randomly selected as one of the 300 observations from
10
each of the six populations. Therefore, it appears that the study is made up of separate simple
random samples. The code below displays the R output of 6 categories of similar size:
> table(mydata$own, mydata$province)
→ Independent, simple random samples: Satisfied
Assumption 2: All populations are normally distributed
To check all populations are normally distributed or not, we can use Q-Q plot with R:
> library(car)
>qqPlot (lm (totalass ~ province + own, data=mydata), simulate=T, main="Q-Q
Plot", labels=F)
From the scatter corresponds, we can see that the data is non normal distribution. Therefore,
the Q-Q plot cannot satisfy this assumption.
→ All populations are normally distributed: Dissatisfied
Assumption 3: All populations have the same standard deviation
We check whether all of the standard deviations are equal by using by() function as follow:
> by(mydata$totalass, list(mydata$province,mydata$own),sd)
11
From the R output, we observe that:
Largest standard deviation 57826.06
= = 11.10403
Smallest standard deviation 5207.662
As can be seen, the result is much greater than 2.5 so we cannot apply Levene test.
Moreover, the sample size is large therefore Levene test is not suitable for this data. We can
conclude that not all the population standard deviations are equal.
Step 3: Test statistic and P-value
The R code for running two-way ANOVA test and its output is given below:
> ques3.result<-aov(totalass ~ province*own, data = mydata)
> summary (ques3.result)
Step 4: Level of significance: α = 0.05
Step 5: Decision rule
In this test statistic, p-value is taken into account of consideration to make decisions.
Therefore, we reject Ho if the p-value < α.
1. Test for the difference in Total Asset due to Province, p-value = 0.134 > α
→ Do not reject H1
2. Test for the difference in Total Asset due to Own, p-value = 0.0293 < α
12
→ Reject Ho2
3. Test for the interaction between Province and Own: p-value = 0.2889 > α=0.05
→ Do not reject H3
We shall disregard the interaction impact but not the main effect of each factor separately
when the interaction is not significant.
Step 6: Conclusion
Conclusion 1: We do not have sufficient statistical evidence to conclude that there are
differences in total asset due to province at 0.05 significance level.
Conclusion 2: We have enough sufficient statistical evidence to conclude that there are
differences in total asset due to types of ownership at 0.05 significance level.
Conclusion 3: We do not have sufficient statistical evidence to conclude that there is a
significant interaction between provinces and types of ownership in total asset with 0.05
level of significance.
Interaction plot and interpretation
The interaction plot using R Code below is another approach to see whether there is no
interaction between Ownership types and Provinces on the totalasset differences. In general,
multi-owner profits outnumber private-sector profits in all three provinces because while
the former sees a positive, the latter perceives a negative. First, there is a significant
discrepancy in profitability between multi-owner and privately held businesses in Ho Chi
Minh City. More specifically, the plot discovered a significant difference in profits of Multi-
owner and One-owner in Hanoi compared to the other two provinces. These may be
reasonable numbers because as the plot shows, both the private and multi-owner lines are
going down and multi-owner is still larger than private one. Therefore, they seem to intend
to grow larger than One-owner. However, in Hai Phong, a gradually clearer difference has
13
formed between many owners and private individuals, focusing on capital development
with more owners than private ones, so the red line also increased but not much like the
blue line. Which leads to the fact that, in Ho Chi Minh City, there is definitely no interaction
between province and own. Furthermore, while the flow in Ho Chi Minh City has an
increasing trend, it is clear that income is realized at a higher level due to private state
holdings, but in the plot, the two lines of Hai Phong and Ho Chi Minh City are largely
parallel, so interaction between the two forms of ownership has no evidences. And the red
line of Ho Chi Minh City has a down sloped, demonstrating there is no association between
provinces and ownership types and that makes the Conclusion 2 above, there are differences
in profitability. However, because the two pathways are not fully parallel in the plot, the
interaction is described as moderate. This outcome is consistent with the conclusion reached
in question 4 when we utilized the 0.05 significance level to test the alternative hypothesis.
Question 4: Discuss the credibility of the interpretations and conclusions of
these tests. Is there anything we should be concerned about? Explain.
Regarding the credibility of the interpretations and conclusions of the tests, in question 2
and question 3, we conducted ANOVA tests. However, because both tests have 2 dissatisfied
assumptions (normal distribution and standard deviations are the same), the results can be
reliable to some extent, but not completely valid.
Moreover, some potential limitations can be seen in question 1. In specific, because the
outliers are too large, we cannot clearly see and interpret the Q-Q plot. Only when we
narrow down the range of limitations can the plot be seen easier.
Question 5: Base on your dataset, make your own problem using
simple/multiple linear regression. Interpret the output.
We use simple linear regression to examine the relationship between 2 variables: quantity
product and quantity sold. First, we use the plot () function to show how the 2 variables
interact with each other:
> plot(mydata$quantityproduct, mydata$quantitysold, xlab = "Quantity product", ylab
= "Quantity sold", main = "Relationship between Quantity product and Quantity sold")
14
Then, we utilize the reg() function to conduct a simple linear regression equation
showcasing the 2 variables’ relationship:
> reg1 <- lm(quantityproduct~quantitysold, data = mydata)
> summary (reg1)
From the result, we can come up with the equation for the 2 variables is:
Quantity product = 22.15 + 2.72*Quantity sold
→ y = 22.15 + 2.72x
In order to better showcase the linear regression, we create scatter plot of data and add
fitted regression line:
> with(mydata, plot( mydata$quantityproduct, mydata$quantitysold))
> abline(reg1)
15
To check how much the equation y = 22.15 + 2.72x fit with the data, first we must see
whether there is a significant relationship between the 2 selected variables:
Ho: There is no relationship
Ha: There is a significant relationship
Since the p-value is smaller than alpha (2e-16<0.05), we can reject Ho and conclude that
there is a significant relationship between the 2 variables.
Then, we check the coefficient of determination to see whether the equation is a good fit.
Looking at the multiple R square value, which is at about 99%, we conclude that the 2
variables share a strong relationship and that the equation is a good fit.
Next, we check the coefficient of correlation to see whether there is a linear relationship.
We perform coefficient test: cor.test() with the hypothesis:
Ho: P = 0
Ha: P is different from 0
> cor.test()
> cor.test(mydata$quantitysold, mydata$totalass,alternative = "two.sided", method
"pearson",conf.level = 0.95)
16
From the result, since the p-value is smaller than alpha (2.2e-16 < 0.05), we can confirm the
relationship that the 2 variables share is a linear relationship.
In conclusion, the linear regression equation can help us somehow predict and estimate the
expected quantity being produced by looking at the quantity sold with a relatively strong
confidence level.
D. PEER EVALUATION FORM
Contribution
Team member Signature (all members)
(100%)
Trần Thị Mai Hương 100%
Trần Thùy Anh 100%
Nguyễn Hồng Nhung 100%
Nguyễn Thùy Linh 100%
Nguyễn Phương Thanh 100%
Phạm Kỳ Thái 100%
Vũ Huyền Linh 100%
17