EDU 411 Topic 5 Data Analysis
EDU 411 Topic 5 Data Analysis
Course Lecturer: Dr. Stephen Kipkorir Rotich Mobile: +254 724 941 908 ; e-mail: rotichkip-
[email protected]
5. DATA ANALYSIS
Measurement Scales
Measurement scales are used to classify and quantify variables in research. They determine the
nature of the data and the types of statistical analyses that can be applied. There are four primary
types of measurement scales:
1. Nominal Scale
o Definition: The nominal scale classifies data into distinct categories that do not
have any inherent order or ranking.
o Characteristics:
Categories are mutually exclusive and exhaustive.
No numerical value or order is associated with the categories.
o Examples: Gender (male, female), nationality (American, Canadian), color (red,
blue).
o Descriptive Statistics: Frequency counts and mode (the most common
category).
2. Ordinal Scale
o Definition: The ordinal scale ranks data in a meaningful order, but the intervals
between ranks are not necessarily equal.
o Characteristics:
Data can be ordered or ranked.
Differences between ranks are not uniform.
o Examples: Satisfaction ratings (very satisfied, satisfied, neutral, dissatisfied, very
dissatisfied), education level (high school, bachelor’s, master’s, doctoral).
o Descriptive Statistics: Median, mode, and range. Percentiles and quartiles
are also used.
3. Interval Scale
o Definition: The interval scale measures variables where the intervals between
values are equal, but there is no true zero point.
o Characteristics:
Equal intervals represent equal differences in the variable being measured.
Lacks a true zero point (zero does not mean the absence of the variable).
o Examples: Temperature in Celsius or Fahrenheit, IQ scores.
o Descriptive Statistics: Mean, median, mode, standard deviation, and
variance.
4. Ratio Scale
o Definition: The ratio scale has all the properties of the interval scale, with the
added feature of a true zero point.
o Characteristics:
Allows for the comparison of absolute magnitudes.
Zero indicates the absence of the variable.
1
o Examples: Weight, height, age, income.
o Descriptive Statistics: Mean, median, mode, range, standard deviation,
variance. Ratios and percentages are also meaningful.
1. Descriptive Statistics
Descriptive statistics is a means of describing features of a data set by generating sum-
maries about data samples.
Descriptive statistics summarize and describe the features of a dataset. They help to pre-
sent data in a meaningful way.
Descriptive statistics explain high-level summaries of a set of information such as the
mean, median, mode, variance, range, and count of information.
The main purpose of descriptive statistics is to:
provide information about the actual characteristics of a data set
help understand data attributes
2
2. Inferential Statistics
Inferential statistics are another broad category of techniques that go beyond describing a
data set.
Inferential statistics can help researchers draw conclusions from a sample to a population.
Inferential statistics are used to examine differences among groups and the relationships
among variables
Inferential statistics are used to make generalizations or inferences about a population
based on a sample. They help to draw conclusions and make predictions.
o Null Hypothesis (H₀): The default assumption that there is no effect or no differ-
ence.
o Alternative Hypothesis (H₁): The hypothesis that there is an effect or a differ-
ence.
o Test Statistics: Values calculated from the sample data used to determine wheth-
er to reject the null hypothesis (e.g., t-test, chi-square test).
o P-Value: The probability of obtaining the observed results, or more extreme re-
sults, assuming the null hypothesis is true. A small p-value indicates strong evi-
dence against the null hypothesis.
Significance Level (α): The threshold for deciding whether to reject the null hypothesis,
commonly set at 0.05.
3
1. Chi-Square Tests
o Definition: Tests the association between categorical variables.
o Types:
Chi-Square Test of Independence: Determines if there is an association
between two categorical variables.
Chi-Square Goodness of Fit Test:
The chi-square goodness of fit test is a statistical test used to determine how well observed data
fits a particular theoretical distribution. Its main purposes are:
2. T-test
A t-test is a statistical test used to determine if there is a significant difference between the means
of two groups. It helps in assessing whether the differences observed are likely due to chance or
if they are statistically significant.
Types of t-tests
One-Sample t-test: Compares the mean of a single sample to a known value or popula-
tion mean. For example, testing if the average height of a sample of students is different
from the known average height of the population.
Independent Two-Sample t-test: Compares the means of two independent groups. For
example, testing if the average test scores of students from two different teaching meth-
ods are different.
Paired Sample t-test: Compares means from the same group at different times or under
different conditions. For example, comparing test scores of the same students before and
after a training program.
Normality: The data in each group should be approximately normally distributed, espe-
cially for small sample sizes. For larger samples, the Central Limit Theorem often helps
in relaxing this assumption.
Homogeneity of Variance: The variance within each group should be approximately
equal. This is more critical for the independent two-sample t-test.
Independence: Observations should be independent of each other.
4
Hypotheses
Null Hypothesis (H₀): Assumes there is no effect or difference. For example, in an inde-
pendent t-test, it might state that the means of the two groups are equal.
Alternative Hypothesis (H₁): Assumes there is an effect or difference. In the independ-
ent t-test, it might state that the means of the two groups are not equal.
In regression analysis, the fundamental purpose is to model the relationship between a dependent
variable and one or more independent variables. Here’s a breakdown of the equations used in
different types of regression models:
For simple linear regression, which models the relationship between a dependent variable (YYY)
and a single independent variable (XXX), the equation is:
where:
For multiple linear regression, which models the relationship between a dependent variable
(YYY) and multiple independent variables (X1,X2,…,XpX_1, X_2, \ldots, X_pX1,X2,…,Xp),
the equation extends to:
Y=β0+β1X1+β2X2+ϵ
where:
Summary
Measurement Scales: Nominal, Ordinal, Interval, and Ratio scales determine the level of
measurement and the types of statistical analysis that can be applied.
Descriptive Statistics: Include measures of central tendency, dispersion, and shape to
summarize and describe data.
Inferential Statistics: Include estimation, hypothesis testing, regression analysis, ANO-
VA, chi-square tests, and correlation analysis to make inferences and predictions about
populations based on sample data.
Understanding these concepts helps researchers design studies, analyze data accurately, and draw
valid conclusions.
6
Table 1
Descriptive statistics
Inferential statistics:
Comparing groups with t-tests and ANOVA
Table 2 presents a menu of common, fundamental inferential tests. Remember that even more
complex statistics rely on these as a foundation.
Table 2
Inferential statistics
Statistic Intent
t tests Compare groups to examine whether means between two groups are statistically signifi-
cant.
Analysis of vari- Compare groups to examine whether means among two or more groups are statistically
ance[ANOVA] significant.
Correlation Examine whether there is a relationship or association between two or more variables. It
(Pearson/Spearman) provides degree /strength of association and whether it is significant
Regression Examine how one or more variables predict another variable and provides the
strength/degree of each variable
7
Examining relationships using correlation and regression
The general linear model contains two other major methods of analysis,
correlation and regression.
Correlation reveals whether values between two variables tend to sys-
tematically change together. Correlation analysis has three general out-
comes: (1) the two variables rise and fall together; (2) as values in one
variable rise, the other falls; and (3) the two variables do not appear to
be systematically related. To make those determinations, we use the cor-
relation coefficient (r) and related p value or CI. First, use the p value or
CI, as compared with established significance criteria (eg, p<0.05), to de-
termine whether a relationship is even statistically significant. If it is not,
stop as there is no point in looking at the coefficients. If so, move to the
correlation coefficient.
A correlation coefficient provides two very important pieces of infor-
mation—the strength and direction of the relationship. An r statistic can
range from −1.0 to +1.0. Strength is determined by how close the value is
to −1.0 or 1.0. Either extreme indicates a perfect relationship, while a
value of 0 indicates no relationship. Cohen provides guidance for inter-
pretations: 0.1 is a weak correlation, 0.3 is a medium correlation and 0.5
is a large correlation.1 2 These interpretations must be considered in the
context of the study and relative to the literature. The valence (+ or −)
coefficient reveals the direction of the relationship. A negative correla-
tion means one value rises, while the other tends to fall, and a positive
coefficient means that the values of the two variables tend to rise and fall
together.
Regression adds an additional layer beyond correlation that allows pre-
dicting one value from another. Assume we are trying to predict a de-
pendent variable (Y) from an independent variable (X). Simple linear re-
gression gives an equation (Y = b0 + b1X) for a line that we can use to pre-
dict one value from another. The three major components of that predic-
tion are the constant (ie, the intercept represented by b0), the systematic
explanation of variation (b1), and the error, which is a residual value not
accounted for in the equation3 but available as part of our regression
output. To assess a regression model (ie, model fit), examine key pieces
of the regression output: (1) F statistic and its significance to determine
whether the model systematically accounts for variance in the depend-
ent variable; (2) the r square value for a measure of how much variance
in the dependent variable is accounted for by the model; (3) the signifi-
cance of coefficients for each independent variable in the model; and (4)
8
residuals to examine random error in the model. Other factors, such as
outliers, are potentially important (see Field4).
The aforementioned inferential tests are foundational to many other ad-
vanced statistics that are beyond the scope of this article. Inferential
tests rely on foundational assumptions, including that data are normally
distributed, observations are independent, and generally that our de-
pendent or outcome variable is continuous. When data do not meet these
assumptions, we turn to non-parametric statistics (see Field4)..5
Statistical software
While the aforementioned statistics can be calculated manually, re-
searchers typically use statistical software that process data, calculate
statistics and p values, and supply a summary output from the analysis.
However, the programs still require an informed researcher to run the
correct analysis and interpret the output. Several available programs in-
clude SAS, Stata, SPSS and R. Try using the programs through a demon-
stration or trial period before deciding which one to use. It also helps to
know or have access to others using the program should you have ques-
tions.