CAB Project
CAB Project
RECORD WORK
M.COM (GENERAL)
NO:20 Ⅳ LANE, NUNGAMBAKKAM HIGH ROAD,
NUNGAMBAKKAM CHENNAI-600034
OCTOBER 2024
DESCRIPTIVE STATISTICS
Descriptive statistics are measures we can use to learn more about the distribution of the
observations in variables for analysis, transforming variables, and reporting. Each descriptive
statistic has their own formula that we will not be covering in the guide, but we will walk
through the interpretation of each.
ASSUMPTIONS:
Assumption 1 : Data independence
Assumption 2 : Data distribution
Assumption 3 : Group variance
Dataset consists of 10 students' scores in 3 different subjects (Math, Science, and
English).
ID Math Science English
1 85 78 90
2 92 88 94
3 76 85 89
4 81 79 80
5 87 92 85
6 75 82 88
7 90 94 95
8 83 89 91
9 78 81 87
10 88 86 84
PROCEDURE:
OUTPUT:
INTERPRETATION:
The table shows summary statistics for Math, Science, and English scores from a sample of
10 students (with 1 missing). The average scores for Math, Science, and English are 83.5,
85.4, and 88.3, respectively, with English having the highest mean. The distribution of scores
is slightly negatively skewed for Math and English, indicating a small bias toward higher
scores, while Science scores are slightly positively skewed. The variance is highest in Math
(34.94), indicating more spread in these scores compared to science (29.38) and English
(20.90). Additionally, the kurtosis values suggest all three score distributions are relatively
flat (platykurtic).
CRONBACH’S ALPHA TEST
Cronbach's alpha is the most common measure of internal consistency ("reliability"). It is
most commonly used when you have multiple Likert questions in a survey/questionnaire that
form a scale and you wish to determine if the scale is reliable.
ASSUMPTIONS:
Assumption 1 : The error terms should not be correlated.
Dataset consists of 10 students' scores in 3 different subjects (Math, Science, and
English).
ID Math Science English
1 85 78 90
2 92 88 94
3 76 85 89
4 81 79 80
5 87 92 85
6 75 82 88
7 90 94 95
8 83 89 91
9 78 81 87
10 88 86 84
PROCEDURE:
OUTPUT:
INFERENCE:
From the above Reliability Statistics table, we infer that the Cronbach’s Alpha value is
0.721, where the threshold value of the reliability test value is 0.7-1.00, thereby indicating the
result that the data provided regarding the scores of 10 different students in subjects, maths,
science and English, is reliable and acceptable at best.
Calculate One – Sample T-Test for the Test scores of 10 students, if their mean score is
significantly different from 75.
ID Test Scores
1 70
2 75
3 80
4 72
5 68
6 77
7 79
8 73
9 71
10 76
HYPOTHESIS:
H0 = The mean score of students is not significantly different from 75.
H1 = The mean score of students is significantly different from 75.
PROCEDURE:
OUTPUT:
INTERPRETATION:
One sample statistics table gives us the standard deviation as 3.957 and the average score as
74.10. One sample test table gives us the t-value, degrees of freedom, significance level and
95% confidence interval for the mean. T-value of -7.19 for 9 degrees of freedom is highly
insignificant as significant value is 0.490. Therefore, we accept the null hypothesis. Thus,
scores of the students are not significantly different than the average level of 75 marks.
INDEPENDENT SAMPLE T-TEST
The Independent-samples t test procedure compares means for two groups of cases and
automates the t test effect size computation. Ideally, for this test, the subjects should be
randomly assigned to two groups, so that any difference in response is due to the treatment
(or lack of treatment) and not to other factors.
ASSUMPTIONS:
Assumption 1 : There is no relationship between the observations in each group.
Assumption 2 : No significant outliers in the two groups.
Assumption 3 : The data for each group should be approximately normally distributed.
Assumption 4 : The variance of the outcome variable should be equal in each group.
Calculate the Independent Samples T- Test for the Test scores of students from two
groups (Group A and Group B).
HYPOTHESIS:
H0 = There is no significant difference in the mean values of the two groups.
H1 = There is significant difference in the mean values of the two groups.
PROCEDURE:
OUTPUT:
INFERENCE:
The Group statistics table gives mean scores of Group A is 73 with a standard deviation of
4.690, and Group B is 75.2 with a standard deviation of 3.194. The Independent samples
table gives us the t-value, degrees of freedom, significance level and 95% confidence interval
for the mean. The t-value of -0.867 for 8 degrees of freedom is not significant as the
significance value is 0.441 which is >0.05. Therefore, we accept the null hypothesis which
means, there is no significant difference in the mean values of scores of students in Group A
and Group B.
Calculate the Paired Sample T- Test between the Pre-test and post-test scores of the
same students.
ID Pre-Test Post-Test
1 70 78
2 75 80
3 80 85
4 72 76
5 68 75
HYPOTHESIS:
H0 = There is no significant difference in the mean values of pre-test and post-test scores
H1 = There is significant difference in the mean values of pre-test and post-test scores.
PROCEDURE:
OUTPUT:
INFERENCE:
Paired samples statistics table gives us the mean score of students Pre-test as 73 with a
standard deviation of 4.69 and mean score of students Post-test as 78.8 with a standard
deviation of 3.96. Paired samples test table shows us that the t-value of -7.893 for 4 degrees
of freedom is highly significant with P value of 0.001. Therefore, we reject the null
hypothesis. Hence, there is a significant difference in the mean values of scores of students’
pre-test and post-test.
ONE WAY ANOVA
The One-way ANOVA procedure produces a one-way analysis of variance for a quantitative
dependent variable by a single factor (independent) variable and estimates the effect size in
one-way ANOVA. Analysis of variance is used to test the hypothesis that several means are
equal. This technique is an extension of the two-sample t test.
ASSUMPTIONS:
Assumption 1 : The data are independent.
Assumption 2 : These distributions have the same variance.
Assumption 3 : The responses for each factor level have a normal population distribution.
Calculate One way ANOVA to Test scores of students from three schools
HYPOTHESIS:
H0 = There is no significant difference in the mean values of test scores with regard to
schools.
H1 = There is significant difference in the mean values of test scores with regard to schools.
PROCEDURE:
OUTPUT:
INTERPRETATION:
The descriptives table shows us the mean scores of school A as 72.5, school B as 82.5 and
school C as 70.5 with a uniform standard deviation of 3.536. The Anova table shows us that
the comparison between the schools in terms of marks scored is highly insignificant with a
significance value of 0.079. Therefore, we accept the null hypothesis. Thus, there is no
significant difference in the mean values of test scores with regard to the schools of students.
TWO-WAY ANOVA
A two-way ANOVA is used to estimate how the mean of a quantitative variable changes
according to the levels of two categorical variables. Use a two-way ANOVA when you want
to know how two independent variables, in combination, affect a dependent variable.
ASSUMPTIONS:
Assumption 1 : Observations should be independent of one another.
Assumption 2 : The dependent variable should be normally distributed within each group.
Assumption 3 : The effects of the independent variables are additive.
Assumption 4 : The independent variables should be categorical with fixed levels.
HYPOTHESIS:
H0 = There is no significant interaction effect between gender and school on test scores.
H1 = There is a significant interaction effect between gender and school on test scores.
PROCEDURE:
INTERPRETATION:
The two-way ANOVA results indicate that the overall model explains the variability in test
scores, with a corrected model sum of squares of 202.833. The descriptive statistics show
that the mean test scores for females are 75.00, 85.00, and 73.00 across schools A, B, and C,
respectively, leading to an overall female mean of 77.67. For males, the mean scores are
70.00, 80.00, and 68.00, resulting in an overall male mean of 72.67. However, the analysis
does not provide F-values or significance (p-values) for gender, school, or their interaction,
which prevents us from drawing conclusions about the significance of these factors on test
scores. Notably, the interaction effect between gender and school showed a sum of squares of
0.000, suggesting no interaction effect present in the data.
CORRELATION
Correlation coefficients provide a numerical summary of the direction and strength of the
linear relationship between two variables. The 2 main correlation coefficients are Pearson’s
and Spearman’s. The sign of the correlation coefficient indicates the direction of the
correlation: a positive correlation indicates that as one variable increases, so does the other; a
negative correlation indicates that as one variable increases, the other decreases. The strength
of the relationship is given by the numeric value: 1 indicates a perfect relationship; 0
indicates no relationship between the variables.
ASSUMPTIONS:
Assumption 1 : The two variables (the variables of interest) need to be using a continuous
scale.
Assumption 2 : The two variables of interest should have a linear relationship, which you can
check with a scatterplot.
Assumption 3 : There should be no spurious outliers.
Assumption 4 : The variables should be normally or near-to-normally distributed.
HYPOTHESIS:
H0 = There is no significant correlation between study hours and test scores.
H1 = There is significant correlation between study hours and test scores.
PROCEDURE:
OUTPUT:
INFERENCE:
According to Pearson’s model of Correlation the study hours and test scores are positively
and almost perfectly correlated as the correlation coefficient value is 0.982. According to
Spearman’s model of Correlation the study hours and test scores are positively and almost
perfectly correlated as the correlation coefficient value is 0.975. Both the methods of
correlation are considered to be highly significant with a P value of 0.003 for Pearson and
0.005 for Spearman. Therefore, we can reject the null hypothesis. Thus, we can conclude that
there is a very strong relationship between study hours and marks scored.
REGRESSION
Linear regression is the next step up after correlation. It is used when we want to predict the
value of a variable based on the value of another variable. The variable we want to predict is
called the dependent variable (or sometimes, the outcome variable). The variable we are
using to predict the other variable's value is called the independent variable (or sometimes,
the predictor variable).
ASSUMPTIONS:
Assumption 1 : The relationship between the IVs and the DV is linear.
Assumption 2 : There is no multicollinearity in your data.
Assumption 3 : The values of the residuals are independent.
Assumption 4 : The variance of the residuals is constant.
HYPOTHESIS:
H0 = There is no significant relationship between study hours and test scores.
H1 = There is significant relationship between study hours and test scores.
PROCEDURE:
OUTPUT:
INFERENCE:
From the Model Summary table, we know that the R value of 0.982 says that 90% of the
variation in test scores is caused by the study hours. From the Coefficients table, we can
summarise that with the t-value of 8.927 and significance value of 0.003, it is highly
significant and therefore, we reject the null hypothesis. Hence, there is a significant
relationship between study hours and test scores.
CHI-SQUARE TEST
The chi-square goodness-of-fit test is a single-sample nonparametric test, also referred to as
the one-sample goodness-of-fit test or Pearson's chi-square goodness-of-fit test. It is used to
determine whether the distribution of cases in a single categorical variable follows a known
or hypothesised distribution. The proportion of cases expected in each group of the
categorical variable can be equal or unequal.
ASSUMPTIONS:
Assumption 1 : The data in the cells should be frequencies, or counts of cases rather than
percentages or some other transformation of the data.
Assumption 2 : The levels (or categories) of the variables are mutually exclusive.
Assumption 3 : The study groups must be independent.
Test the association or independence between two categorical variables gender and
preference for two brands (Brand A and Brand B).
PROCEDURE:
OUTPUT:
INFERENCE:
From the above table, we infer the Chi-square value as 0.667 between the variable, Gender,
and Preference of brand. The tests show a significance value of 0.414 with 1 degree of
freedom thereby, making it not significant. Thus, we accept the null hypothesis. Hence, there
is no significant association between gender and preference of brand for the respondents.
MANN-WHITNEY U TEST
The Mann-Whitney U test procedure uses the rank of each case to test whether the groups are
drawn from the same population. Mann-Whitney tests that two sampled populations are
equivalent in location. The observations from both groups are combined and ranked, with the
average rank assigned in the case of ties. The number of ties should be small relative to the
total number of observations. If the populations are identical in location, the ranks should be
randomly mixed between the two samples.
ASSUMPTIONS:
Assumption 1 : The dependent variable should be measured at the ordinal or continuous
level.
Assumption 2 : The independent variable should have two categorical, independent groups.
Assumption 3 : There should be independence of observations.
Assumption 4 : Test can be used when two variables are not normally distributed.
Test the scores of two groups of students (Group A and Group B).
HYPOTHESIS:
H0 = The distribution of test scores for the two groups is equal.
H1 = The distribution of test scores for the two groups is not equal.
PROCEDURE:
OUTPUT:
INTERPRETATION:
The Mann-Whutney U test results indicate no significant difference in test scores between
Group A and Group B. The Mann-Whitney U statistic is 1.000 and the asymptotic
significance (p-value) is 0.127, which is greater than the conventional alpha level of 0.05,
leading us to fail to reject the null hypothesis. These results collectively suggest that the
distributions of test scores for both groups are similar, and there is insufficient evidence to
conclude that one group significantly outperforms the other.
The Wilcoxon signed-rank test is the nonparametric test equivalent to the dependent t-test. As
the Wilcoxon signed-rank test does not assume normality in the data, it can be used when this
assumption has been violated and the use of the dependent t-test is inappropriate. It is used to
compare two sets of scores that come from the same participants. This can occur when we
wish to investigate any change in scores from one time point to another, or when individuals
are subjected to more than one condition.
ASSUMPTIONS:
Assumption 1 : The dependent variable is continuous or ordinal data.
Assumption 2 : The independent variable is related and matched pairs.
Assumption 3 : The distribution of the differences between the two related groups needs to be
symmetrical in shape.
Assumption 4 : Two samples are not normally distributed, and samples include outliers or
heavy tails.
Calculate the Wilcoxon matched pairs test to evaluate the Pre-test and post-test scores
of the same group of students.
ID Pre-Test Post-Test
1 80 85
2 75 80
3 90 90
4 70 75
5 85 88
HYPOTHESIS:
FRIEDMAN TEST
The Friedman test is the non-parametric alternative to the one-way ANOVA with repeated
measures. It is used to test for differences between groups when the dependent variable being
measured is ordinal. It can also be used for continuous data that has violated the assumptions
necessary to run the one-way ANOVA with repeated measures.
ASSUMPTIONS:
Assumption 1: One group that is measured on three or more different occasions.
Assumption 2: Group is a random sample from the population.
Assumption 3: The dependent variable should be measured at the ordinal or continuous level.
Assumption 4: Samples do not need to be normally distributed.
Test the scores of students measured over three different conditions (Condition 1,
Condition 2, Condition 3).
PROCEDURE:
OUTPUT:
INFERENCE:
The Friedman Test results indicate that there is a statistically significant difference in the
effectiveness of the three studying conditions. The null hypothesis (H 0), which states that the
distributions of the ranks of the groups are the same, is rejected in favour of the alternative
hypothesis (H1), which suggests that at least one of the treatments is different from the others.
The mean ranks for the conditions show that COND3 has the highest mean rank (3), followed
by COND2 (2), and COND1 has the lowest mean rank (1). With an asymptotic significance
(p-value) of .007, which is below the conventional alpha level of 0.05, we can conclude that
there is a significant difference in the scores of students due to the conditions. Therefore, at
least one of the condition appears to be more effective than the others in terms of respondent
satisfaction or outcome.
WILCOXON SIGNED-RANK TEST
The Wilcoxon signed-rank test is the nonparametric test equivalent to the dependent t-test. As
the Wilcoxon signed-rank test does not assume normality in the data, it can be used when this
assumption has been violated and the use of the dependent t-test is inappropriate. It is used to
compare two sets of scores that come from the same participants. This can occur when we
wish to investigate any change in scores from one time point to another, or when individuals
are subjected to more than one condition.
ASSUMPTIONS:
Assumption 1 : The dependent variable is continuous or ordinal data.
Assumption 2 : The independent variable is related and matched pairs.
Assumption 3 : The distribution of the differences between the two related groups needs to be
symmetrical in shape.
Assumption 4 : Two samples are not normally distributed, and samples include outliers or
heavy tails.
HYPOTHESIS:
H0 = There is no difference in the satisfaction scores between before & after training.
H1 = There is difference in the satisfaction scores between before & after training.
PROCEDURE:
OUTPUT:
INFERENCE:
From the test statistics table, we infer the Z value to be -2.236. The ranks table shows us
that the positive ranks are present when the after training satisfaction scores are greater than
the before training satisfaction scores. It is clear from the significance value of 0.025 that we
need to reject the null hypothesis. Thus, there is significant difference in the satisfaction
scores pertaining to before and after training.