0% found this document useful (0 votes)
24 views22 pages

Example Report

This document describes using descriptive statistics and analysis of variance to analyze a dataset. Various descriptive statistics are calculated for variables by group including mean, median, standard deviation, and through boxplots. Normality and homogeneity of variance are also checked before conducting one-way ANOVA tests to determine significance of differences between groups.

Uploaded by

Trần Thảo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views22 pages

Example Report

This document describes using descriptive statistics and analysis of variance to analyze a dataset. Various descriptive statistics are calculated for variables by group including mean, median, standard deviation, and through boxplots. Normality and homogeneity of variance are also checked before conducting one-way ANOVA tests to determine significance of differences between groups.

Uploaded by

Trần Thảo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Question 1: Produce descriptive statistics to summarize the data.

You are expected to generate as


many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs
of this course. Remember to provide appropriate interpretations for the descriptive statistics. Try
not to include unnecessary or irrelevant descriptive statistics.

We describe statistics using RStudio. First, we import the Excel file "Datasets.xlsx" into R for
further analysis:

➢ Dataset4 <-read.table("Dataset4.csv", header=TRUE, sep = ",",stringsAsFactors = FALSE)

There are 180 observations in this case study, therefore, we should see some first observations to
have better knowledge related to this data using head () function in R :

Figure 1: Some first observations of the data set


From the previous output, we can conclude that there are 180 observations with 3 variables: roa,
own, and province. Because own and province are characters, we will convert them into factors
by using the following R codes:

➢ Datasets$own <- factor(Datasets$own, levels = c("one-owned","multi-owned"))

➢ Datasets$province <-factor(Datasets$province, levels=c("Hanoi","Haiphong","TP HCM"))

After that, we use the str() function again to obtain the new structure of the data with “own” and
“province” being converted into factors:
➢ str(Datasets)

Figure 2: Structure of the data when factors have been converted


The following table() function: tableName <- table(row variable, column variable) can be used
to generate a frequency table to determine the sample size of each treatment group:

➢ table(Dataset4$own,Dataset4$province)

Figure 3: Frequency table of sample size

It can be seen all six treatment groups have the same sample size of 30. This is our best option
for a two-way ANOVA test. Following that, we use by () function to obtain numerous descriptive
statistics such as mean, median, standard deviation, and summary,.. for each treatment group
mentioned by the factors and their output respectively:

➢ by(Dataset4$roa,list(Dataset4$own,Dataset4$province),mean)

Figure 5: Mean of the data set


➢ by(Dataset4$roa,list(Dataset4$own,Dataset4$province),sd)

Figure 6: Standard deviation of the data set

➢ by(Dataset4$roa,list(Dataset4$own,Dataset4$province),summary)
Figure 7: Summary of the data set

Every code has its own distinct function that provides specific descriptive statistics data of
the outcome variable for the treatment groups with listed Own and Province. Then, after
summarizing the total figures, it can easily categorize the 5 basic statistics: Min value, 1st
quartile, Mean, Median, 3rd quartile and Max value.

Nextly, we use code to do the boxplot and mean plot to examine the findings more closely:

➢ boxplot(roa~interaction(own,province),data = Dataset4,xlab = "Ownership


and Province", ylab ="ROA", col=c("red","blue","yellow","pink","gray","purple"))

Figure 8 : Boxplot
According to the diagram, there are a variety of different box plots shapes and positions. The
dataset's minimum and maximum values, medians, quartiles, and outliers are displayed in the
box plot above. A variety of potential box plot locations and forms are shown in the diagram.
The distribution for those groups as well as the range of ROA in the six categories mentioned
above are shown using box plots. The state-owned portion of Ho Chi Minh City has the highest
middle ROA, whereas the privately-owned portion of Ho Chi Minh City obtains the lowest
middle ROA, as shown by the black horizontal line in each box in this output section.

The mean plot can be used to determine the mean value as well as the mean comparison between
treatment groups:

➢ install.packages("gplots")

➢ library(gplots)

➢ plotmeans(roa~interaction(own,province),data = Dataset4,xlab = "Ownership and


Province",ylab = "ROA",main="Mean Plot with 95% CI")

Figure 9 : Mean plot with 95% CI

Six groups in the mean plot each have a 95% confidence interval. The by() function for means
provides the foundation for the outcomes of this mean plot. The figure indicates a significant
disparity between the mean value of firms in Ho Chi Minh City and those in the other two cities.
Although the values of all categories range from 0.01 to 0.05, Ho Chi Minh's private-owned
enterprises had the lowest mean value in the sample. Furthermore, it is noticeable that six groups
have different mean values, demonstrating that they satisfy the assumptions of the two-way
ANOVA.

Question 2: Use analysis of variance to test for any significant differences due to province. Use a
.05 level of significance, and for now, ignore the effect of types of ownership. Check all the
assumptions of the inference technique you use. Are the assumptions satisfied? Explain.

Step 1: Hypothesis

- Ho: All population means are equal (μ1 = μ1 = ... = μk)

- Ha: At least two population means are different

Step 2: Test statistics

Before conducting a one-way ANOVA, we must check to ensure that three assumptions are met.

Assumption 1: Samples are independent, simple random samples

Firstly, we divide the total sample size into three group factors that are not influenced by the
other. Furthermore, the responders were also recorded separately. As a result, we can conclude
that samples are independent and randomly selected.

Assumption 2: All population in question are normally distributed

We use the Q-Q plot with R command to check the second assumption: All populations in
question are normally distributed.

➢ question2<-read.table("Datasets.csv",header=TRUE,sep = ",", stringsAsFactors = F)

➢ str(question2)
➢ question2$province<-factor(question2$province, levels=c("1","2","3"), labels =
c("HaNoi", "DaNang", "HoChiMinh"))

➢ question2$province

➢ qqPlot(lm(roa ~ province, data=question2), simulate=T, main="Q-QPlot", labels=F)

With a sample size of 180, the normality of residuals can be seen using a normal Q-Q plot. The
scatter compares the data to a perfectly normal distribution. We can see from the plot that almost
all points lie approximately near the straight line. Therefore, we conclude that the populations are
normally distributed.

Assumption 3: All populations have the same standard deviation

⮚by(dataset4$roa, dataset4$province, sd)

dataset4$province: Hanoi

[1] 0.06334152
dataset4$province: Danang

[1] 0.08371056

dataset4$province: Hochiminh

[1] 0.2478101

We have Largest SD/ Smallest SD = 0.2478101/0.06334152 = 3.912285 > 2 so we can check the
third assumption using Levene’s test. This test has the following hypotheses:

- Ho: The variance among the groups is equal.


- Ha: The variance among the groups is not equal.

If the p-value from the test is less than our chosen significance level 𝛼= .05, we can reject the
null hypothesis and conclude that we have enough evidence to state that the variance among the
groups is not equal. In R, this test can be performed thanks to the leveneTest() function from the
{car} package we have installed earlier.

⮚ LeveneTest(question2$roa, question2$province, center =

median) Levene's Test for Homogeneity of Variance (center = median)

Df F value Pr(>F)

group 2 5.8767 0.00338 **

177

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value of the test is 0.00388, which is smaller than our significance level of 0.05. So the
assuption is not satisfied. Assume that the variance among the groups is equal and assume that
the third assumption is reasonable.

The ANOVA test

After all three assumptions are satisfied, we now run the ANOVA test. The ANOVA command is
as follows:

⮚ aov1 <-aov(roa~province, data=question2)


⮚ summary(question2)

Df Sum Sq Mean Sq F value Pr(>F)

province 2 0.166 0.08303 3.439 0.0343 *

Residuals 177 4.273 0.02414

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Step 3: Level of significance

The level of significance: α=0.05

Step 4: Decision rule


We will reject Ho if p-value ≤ α.

Step 5: Value of test statistic

For one-way ANOVA, the decision rule is that if p-value is smaller than alpha, the null
hypothesis will be rejected. By conducting summary(aovl), we have p-value is 0.0343 which is
smaller than alpha= 0.05. Therefore, we decide to reject the null hypothesis.

Step 6: Conclusion
The One-way Analysis of Variance was performed to compare the effect of the type of province
on the profitability of a business. With the result in step 5, we have enough evidence to conclude
that there are significant differences in the profitability of businesses due to types of province.

Question 3: Use analysis of variance to test for any significant differences due to types of
ownership. Use a .05 level of significance, and for now, ignore the effect of province. Check all
the assumptions of the inference technique you use. Are the assumptions satisfied? Explain.

Step 1: Hypothesis
- Ho: All population means are equal (μ1 = μ1 = ... = μk)
- Ha: At least two population means are different
Step 2: Assumptions
Based on the insights and knowledge we learned in the BES course, the One-Way Analysis of
Variance is the best inference approach to resolving this question in our case study. Before we
can conduct a one-way ANOVA, we must first check to make sure that three assumptions are
met.
1. Samples are independent, simple random samples of data from the population.
2. The dependent variables for each group are normally distributed.
3. The variances of the populations that the samples come from are equal.
The first assumption can only be satisfied if a random design is carried out. As we know, this
survey was conducted on more than 2 million enterprises in all regions of the country in 2004.
The questionnaire contains many parts, in which each part is related to a different aspect in
business fields. From that, we conclude that the first hypothesis is valid.
To check the second assumption, we use Q-Q plots as an approach. Firstly, we import the
dataset4.cvs data frame into R Studio and accredit it to question3
⮚ question3 <-read.table("datasets.csv", header=TRUE, sep = ",",stringsAsFactors =
FALSE)
⮚ str(questi
on3) 'data.frame':
180 obs. of 3 variables:
$ roa : num 9.75e-03 4.36e-05 1.81e-03 9.01e-04 7.36e-03 ...
$ own : chr "state-owned" "state-owned" "state-owned" "state-owned" ...
$ province: int 1 1 1 1 1 1 1 1 1 1 ...
With the use of the Q-Q plot, we can graphically confirm the normality of the data. For each
sample, a unique Q-Q plot can be generated, allowing us to evaluate whether or not they are all
normally distributed. As an alternative, we can develop the following plot to analyze the
residuals' normality. To access the Q-Q plot function, install the {car} package.
⮚ install.packages(“car”)
⮚ library(car)
⮚ qqPlot(lm(roa ~ own, data = question3), simulate = T, labels=F)
Below is the graphical output after we run the code in R:

In a Q-Q plot, if the data points align along a straight diagonal line, the dataset appears to follow
a normal distribution. We can observe that, with only a few minor deviations along each of the
tails, the points are mainly located along the straight diagonal line. We can confidently assume
that this data set is normally distributed based on the plot.
We can check the third assumption using the Levene’s test because Largest SD/ Smallest SD =
0.2049527/0.08038714 = 2.549571 greater than 2.
⮚ by(Dataset4$ï..roa, Dataset4$own, sd)
Dataset4$own: state-owned
[1] 0.08038714

Dataset4$own: private-owned
[1] 0.2049527
This test has the following hypotheses:
- Ho: The variance among the groups is equal.
- Ha: The variance among the groups is not equal.
If the p-value from the test is less than our chosen significance level 𝛼= .05, we can reject the
null hypothesis and conclude that we have enough evidence to state that the variance among the
groups is not equal. In R, this test can be performed thanks to the leveneTest() function from the
{car} package we have installed earlier.
⮚ leveneTest(roa ~ own, question3)
The output is above:
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 1.2044 0.2739
178
The p-value of the test is 0.2739, which is higher than our significance level of 0.05. So we do
not have enough evidence to reject the null hypothesis and conclude that the variance among the
groups is equal and state that the third assumption is reasonable.
The ANOVA test
After all three assumptions are satisfied, we now run the ANOVA test. The ANOVA command is
as follows:
⮚ aov(roa~own, data = question3)
where ‘roa’ and ‘own’ are the dependent and independent variables. The final argument is the
name of the data structure being analyzed.
Call
aov(formula = roa ~ own, data = question3)
Terms:
own Residuals
Sum of Squares 0.125775 4.313627
Deg. of Freedom 1 178
Residual standard error: 0.1556723
Estimated effects may be unbalanced
Give the hypotheses and run the one-way ANOVA with expected question3 as the outcome
variable and own as the factor. The results of the ANOVA can be seen with the summary()
command.
⮚ aov1 <- aov(roa~own, data=question3)
⮚ summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
own 1 0.126 0.12577 5.19 0.0239 *
Residuals 178 4.314 0.02423
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Step 3: Level of significance

The level of significance: α=0.05

Step 4: Decision rule


We will reject Ho if p-value ≤ α.

Step 5: Value of test statistic

The p-value is 0.023898. We can confirm the p-value = 0.023898 < 𝛼 =0.05 and state that the
null hypothesis is rejected.

Step 6: Conclusion
The One-Way Analysis of Variance was performed to compare the effect of type of ownership on
profitability of businesses. This method revealed that there was a statistically significant
difference in mean roa between two groups {F(1, 178) = [5.1906], p = [0.023898]}. There is
enough evidence to conclude that there are any significant differences in profitability of
businesses due to types of ownership.
Question 4: At the .05 level of significance test for any significant differences due to province,
types of ownership, and interaction. Check all the assumptions of the inference technique you
use. Are the assumptions satisfied? Explain.

I-Assumptions:
(1) Sample are independent, simple random sample of size n
(2) All populations have the same standard deviation
(3) All populations are normally distributed
II- Checking process
(1) Sample are independent, simple random sample of size n

(2) All population have the same standard deviation

⮚ by(Dataset$roa,list(Dataset$own,Dataset$province), sd)

: private-owned

: province 1

[1] 0.01813046

: state-owned

: province 1

[1] 0.08496648
: private-owned

: province 2

[1] 0.1124882

: state-owned

: province 2

[1] 0.03961233

: private-owned

: province 3

[1] 0.323886

: state-owned

: province 3

[1] 0.1048471

Largest SD/ Smallest SD = 0.323886/0.01813046= 17.86419098 (Greater than 2, not clear to pool
variances). Then using Levene’s test:

⮚ leveneTest(Dataset$roa,interaction(Dataset$own,Dataset$province),center=median)

Levene's Test for Homogeneity of Variance (center = median)


Df F value Pr(>F)
group 5 3.2052 0.008563 **
174
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P-value =0.008563< 0.05 => Reject Ho ( SD is not equal)=> not satisfied. Assume that the
variance among the groups is equal and assume that the third assumption is reasonable.
(3) All populations are normally distributed

-The points lie mostly along the straight diagonal line with some minor deviations along each of
the tails. We could safely assume that this set of data is normally distributed
-Normal residual is normally distributed but there is just one outlier (point 165). Could assume
that the residual is normally distributed upon removing the outlier

III- Perform the inference technique


Firstly, the test for any significant interaction (because if the interaction effect is significant->
ignore the main effect).We choose to use the two-way ANOVA test as mentioned in question 1
with a significance level of 0.05.
Step 1: Identify Hypothesis Test
Hypothesis testing for interaction factor:
- Ho: There is not a significant interaction between the types of ownership and the Province.
- Ha: There is a significant interaction between the types of ownership and the Province.
Step 2: Test statistic
Check assumptions:
(1) Samples are independent, simple random samples of size n
(2) All populations have the same standard deviation
(3) All populations are normally distributed
Test statistic and p-value:
We used R studio to calculate and had the output as following:

⮚ Dataset.result<-aov(roa ~ own*province, data = Dataset)


⮚ summary(Dataset.result)

Df Sum Sq Mean Sq F value Pr(>F)


own 1 0.126 0.12577 5.482 0.0203 *
province 2 0.166 0.08303 3.619 0.0289 *
own:province 2 0.155 0.07763 3.383 0.0362 *
Residuals 174 3.992 0.02294
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Step 3: Level of significance


The level of significance: α=0.05
Step 4: Decision rule
We will reject Ho if p-value ≤ α.
Step 5: Value of test statistic
To test the interaction between the types of ownership and the Province, we got: p-value=0.0362
< α=0.05 => Reject Ho
Step 6: Conclusion
There is enough evidence to conclude that the interaction between the types of ownership and the
Province is significant.

Question 5: Draw an interaction plot and interpret the plot. Is the plot consistent with the
conclusions made in Question 4?

An interaction plot is a way for determining whether or not a two factor interaction appears
graphically. The interaction plot function is used to create it as follow:

⮚ interaction.plot(Dataset$province, Dataset$own, Dataset$roa, type="b", col=c("red",


"blue"), pch=c(16, 18), main = "Interaction between own and province")
As can be observed from the plot above, it is clear that the two lines are not parallel indicating
that there is a significant interaction between “province” and “own”. In other words, there is a
difference in ROA between the two types of ownership depending on their provinces. It is
important to choose the right type of ownership for the location. The government enterprise
always has stable profitability for almost all locations due to the government or local support. In
contrast, the ROA of the private business type has great variation because it depends on whether
the location includes favorable conditions such as: tax policy, government incentives, culture, etc
or not.

In general, we can see that ROA of state-owned enterprises tends to be stable regardless of
province, in contrast to this figure for private ownership with a huge fluctuation. Firstly, province
1 shows that there is a difference in ROA between two types of ownership. In, ROA's
state-owned is highest among 3 provinces and this figure for the remaining ownership is not too
low. Province 2, which has the return on asset ratio of private ownership is highest in the 3 places
showing that private businesses besides having efficient profit management on their own, they
also get a lot of favorable conditions in this location.This province also has the smallest disparity
of ROA between two types of ownership indicating that the most economically developed areas
comparing to others. In province 3, the trend of ROA’s private ownership is strong downward at
the lowest levels which signal the poor profit management ability and very few conditions to
support the development of private enterprises.
Through analyzing the graph, we can sê that the plot is consistent with the conclusions made in
question 4. The conclusion said that there is the interaction between “ own” and “province” is
significant

You might also like