0% found this document useful (0 votes)
28 views99 pages

Data Analysis and Statistical Treatment

The document provides an overview of data analysis and statistical treatment, focusing on the types of data (qualitative and quantitative), descriptive and inferential statistics, and various statistical measures. It explains the purpose of descriptive statistics, including measures of central tendency and variability, and introduces inferential statistics for hypothesis testing. Additionally, it discusses tools for data analysis, specifically JASP, and provides examples and practice problems for applying statistical concepts.

Uploaded by

Kurt Doncillo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views99 pages

Data Analysis and Statistical Treatment

The document provides an overview of data analysis and statistical treatment, focusing on the types of data (qualitative and quantitative), descriptive and inferential statistics, and various statistical measures. It explains the purpose of descriptive statistics, including measures of central tendency and variability, and introduces inferential statistics for hypothesis testing. Additionally, it discusses tools for data analysis, specifically JASP, and provides examples and practice problems for applying statistical concepts.

Uploaded by

Kurt Doncillo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Data Analysis and

Statistical Treatment
Objectives
›Utilize appropriate statistical tools in
inferential statistics
›Interpret data/statistical results
›Test hypothesis
›Draw conclusion
Data
Data- is a collection of facts such as
numbers, words, measurements,
observations, or just descriptions of
things.
Data
Qualitative data – describes qualities
or characteristics collected through
questionnaires, interviews, or
observation.
Quantitative data – is the data that
can be counted or measured in
numerical values.
Qualitative data
Nominal data - a type of data that is
used to label variables without providing
any quantitative value.
e.g.
Names – Alexa, Michael
Colors – Orange, Yellow
Texture – Rough, Smooth
Odor – Pleasant, Unpleasant
Qualitative data
Ordinal Data - can be classified into
categories that are ranked in a natural
order.
-Ranking – 1st, 2nd, 3rd
-Socioeconomic status – poor, middle
class, rich
-Likert scales – extremely agree to
extremely disagree
Quantitative data
Interval data - is defined as a data
type that is measured along a scale,
in which each point is placed at equal
distance from one another.
Temperature (Fahrenheit and Celsius)
pH measure
Quantitative data
Ratio data - a form of quantitative
(numeric) data. It measures variables
on a continuous scale, with an equal
distance between adjacent values. A
distinguishing property of ratio data is
that it has a 'true zero’.
Length, Mass, Height, temperature in
Kelvin scale
Exercise no. 1
(Identify what type of data)
1. 10 seconds
2. Citrus x microcarpa
3. 1.5 microgram
4. Female, Male
5. 20 °C
6. Teacher I, Teacher II, Teacher III
7. 35 K
8. pH 5.5
9. 9.8 m/s2
10. Exam score
Statistics
- the science concerned with developing and
studying methods for collecting, analyzing,
interpreting, and presenting empirical data.
Two types of Statistics
1. Descriptive Statistics
2. Inferential Statistics
Descriptive Statistics
-statistics that summarize or describe features
of a data set, such as its central tendency or
dispersion.
-Descriptive statistics are broken down into
measures of central tendency and measures of
variability (spread), measures of frequency
distribution.
Purpose of Descriptive Statistics
-The main purpose of descriptive statistics is
to provide information about a data set.
- Descriptive statistics summarize the large
amount of data into several useful bits of
information.
Can Descriptive Statistics be used to
make predictions or inference?
-No. While these descriptives help understand
data attributes, inferential statistical
techniques—a separate branch of statistics—
are required to understand how variables
interact with one another in a data set.
Descriptive Statistics
-Measures of Central Tendency – describe the
center of the data (Mean, Median, Mode)
-Measure of Variability – describe the
dispersion of the data set (variance, standard
deviation, range)
-Measure of frequency distribution – describe
the occurrence of data within the data set.
Measures of Central Tendency
-Measures of central tendency describe the
center position of a distribution for a data set. A
person analyzes the frequency of each data
point in the distribution and describes it using
the mean, median, or mode, which measures
the most common patterns of the analyzed
data set.
Measures of Central Tendency
-Mean (Average) - is nothing but the average.
It is computed by adding all the values in the
data set divided by the number of observations
in it.
Advantages of mean
› The mean uses every value in the data and hence is a
good representative of the data. The irony in this is
that most of the times this value never appears in the
raw data.

› Repeated samples drawn from the same population


tend to have similar means. The mean is therefore the
measure of central tendency that best resists the
fluctuation between different samples.
Disadvantage of mean
The important disadvantage of mean is that it
is sensitive to extreme values/outliers,
especially when the sample size is small.
Therefore, it is not an appropriate measure of
central tendency for skewed distribution.
Measures of Central Tendency
-Median – is the value in the middle of the
data set.

- Mode – the value that occurs most frequently


Measures of Variability (Dispersion)
-Measures of variability (or the measures of
spread) aid in analyzing how dispersed the
distribution is for a set of data. For example,
while the measures of central tendency may
give a person the average of a data set, it does
not describe how the data is distributed within
the set. To measure variability, researchers can
use variance, standard deviation, and range.
Measures of Variability (Dispersion)
Variance – a measurement of how far each
number in a data set is from the mean
(average), and thus from every other number in
a data set.
Standard deviation (or σ) is a measure of
how dispersed the data is in relation to the
mean. Low standard deviation means data are
clustered around the mean, and high standard
deviation indicates data are more spread out.
Measures of Variability (Dispersion)
Range – is the difference between the highest
and the lowest value in the data set.
Practice Problem #1
› Anna measured the
temperature five times
a day. These are her
measuring results:
35°C, 38°C, 37°C, 34°C,
28°C. Find the average
temperature, the
variance, and the
standard deviation of
the data set.
Practice Problem #1
Data (𝑥 − 𝑥)ҧ (𝑥 − 𝑥)ҧ 2
35°C 0.6 0.36
38°C 3.6 12.96
37°C 2.6 6.76
34°C -0.4 0.16
28°C -6.4 40.96
𝑥ҧ = 34.4°C 61.2
Practice Problem #1
Data (𝑥 − 𝑥)ҧ (𝑥 − 𝑥)ҧ 2 2
σ 𝑥 − 𝑥ҧ
35°C 0.6 0.36 𝑠2 =
38°C 3.6 12.96 𝑛−1
37°C 2.6 6.76 61.2
2
34°C -0.4 0.16 𝑠 =
5−1
28°C -6.4 40.96
2
ഥ = 𝟑𝟒. 𝟒°C
𝒙 61.2 𝑠 = 15.3
Extract the square root of the variance to get the
standard deviation. 𝑠 2 = 3.91
Practice Problem #1
› Anna measured the temperature
five times a day. These are her
measuring results: 35°C, 38°C,
37°C, 34°C, 28°C. Find the
average temperature, the
variance, and the standard
deviation of the data set.
›𝒙
ഥ = 𝟑𝟒. 𝟒°C , 𝒔𝟐 = 𝟏𝟓. 𝟑,
› s = 𝟑. 𝟗𝟏
Practice Problem #2
›Find the variance,and
standard deviation for
the following sample: 12,
13, 24, 24, 25, 26, 34,
35, 38, 45, 46, 52, 53,
78, 78, 89.
Practice Problem #3
›Find the population
variance and population
standard deviation for
the following set of
numbers: 28, 29, 30, 31,
32.
Tools in Data Analysis
Introduction to JASP
Tools in Data Analysis

SPSS SAS Jamovi PSPP

Minitab Stata JASP R


JASP
› JASP - stands for Jeffrey’s Amazing
Statistics Program in recognition of the
pioneer of Bayesian inference sir Harold
Jeffreys.
› This is a free multiplatform open source
statistics package, developed and
continually updated by the group of
researchers at the University of Amsterdam.
JASP
JASP
JASP
Frequency Distribution Table
› Provides the frequency or the number of
times an observation occurs in your
data.
› Gives a snapshot of your data whether it
is categorical or numeric.
› Commonly used in studies that
involve/profiling describing a certain
group or phenomenon.
Table 1. Number of respondents per grade level
Total number of Proportional sample
Junior High School
enrolled students per size (respondents) per
Grade level
grade level grade level

Grade 7 817 93

Grade 8 816 93

Grade 9 737 84

Grade 10 735 84

Total 3105 354


WORKSHOP TIME
›Practice making
frequency
distribution
tables using
JASP and the
provided .csv file.
Inferential Data Analysis
Inferential Data Analysis
› Commonly referred to as Inferential
statistics.
› Provides complex analyses that show
relationship and/or differences between
multiple variables; involves the use of
hypotheses.
› Helpful to generalize results and make
predictions.
Parametric Tests Nonparametric Tests
Inferential Data Analysis
PARAMETRIC TESTS NON-PARAMETRIC
TESTS
POPULATION Completely Known Not known /
Unavailable
DATA DISTRIBUTION Normal Probabilistic Non normal /
Arbitrary
MEASUREMENT LEVEL Interval, Ratio Nominal, Ordinal

CENTRAL TENDENCY Mean Median

ASSUMPTIONS YES NO
Inferential Data Analysis – Test of difference
PARAMETRIC TESTS NON-PARAMETRIC TESTS

Comparing one sample to a z-test /


Wilcoxon Signed-rank
known or hypothesized Student (One-sample)
population mean test
t-test
Testing for differences
Student’s (Independent
between two independent Mann-Whitney U test
groups Samples) t-test

Testing for differences Wilcoxon-Signed rank


Paired t-test
between two related groups test
Testing for differences
between two or more One-Way ANOVA Kruskal-Wallis test
independent groups
Which test should I use?
TEST OF SIGNIFICANT DIFFERENCE
(One-sample) T-test / Z-test
› Research is normally carried out in sample
populations, but how close does the sample
reflect the whole population?
› The parametric one-sample t-test determines
whether the sample mean is statistically
different from a known or hypothesized
population mean.
› The null hypothesis (Ho) tested is that the
sample mean is equal to the population mean.
(One-sample) T-test / Z-test
ASSUMPTIONS
Four assumptions are required for a one-sample t-test
to provide a valid result:
➢ The test variable should be measured on a
continuous scale.
➢The test variable data should be independent i.e. no
relationship between any of the data points.
➢ The data should be approximately normally
distributed.
➢There should be no significant outliers.
(One-sample) T-test / Z-test
Rule-of-thumb
As a rule-of-thumb one-sample t-test is
used if you have less than 30 data points
or sample population.
Z-test provides more robust results if you
have 30 or more data points or sample
population.
Z-test (Sample Problem)
› The ABC tire company claims that the
average lifetime of their tires is at least
28,000 km. To check the claim, a taxi
company puts 40 of these tires on its
taxis and gets a mean lifetime of 25,560
km with a standard deviation of 1,350
km. Is the claim true? Test at 5% level of
significance.
Z-test (Sample Problem)
› The ABC tire company claims that the average lifetime of
their tires is at least 28,000 km. To check the claim, a taxi
company puts 40 of these tires on its taxis and gets a mean
lifetime of 25,560 km with a standard deviation of 1,350 km.
Is the claim true? Test at 5% level of significance.
› Problem: Is the claim true that the average lifetime of a
certain brand of tire is at least 28,000 km?
› Null hypothesis: The average lifetime of a certain tire is at
least 28,000 km (𝜇 ≥ 28,000).
› Alternate hypothesis: The average lifetime of a certain tire
is less than 28,000 km (𝜇 < 28,000).
Z-test (Sample Problem)
Z-test (Sample Problem)
› Decision making: Reject H0 if zcalc ≥ ztable
› Z-table:

Level of Significance
Test
0.01 0.05
One-tailed ±𝟐. 𝟑𝟑 ±𝟏. 𝟔𝟒𝟓
Two-tailed ±𝟐. 𝟓𝟕𝟔 ±𝟏. 𝟗𝟔

› Findings: Since zcalc > ztable


› Decision and Interpretation: Reject H0; the average lifetime
of a certain tire is less than 28,000 km.
(One-sample) T-test
Sample Problem
› The hospital claims that its record
shows that the mean weight of newly
born babies is 7 lbs. with a standard
deviation of 0.75 lb. A researcher takes
a sample of 25 newly born babies and
found to have a mean of weight of 6.73
lb. Test the claim at 5% level of
significance.
(One-sample) T-test
Sample Problem
› The hospital claims that its record shows that the mean
weight of newly born babies is 7 lbs. with a standard
deviation of 0.75 lb. A researcher takes a sample of 25
newly born babies and found to have a mean of weight of
6.73 lb. Test the claim at 5% level of significance.
› Problem: Is the claim of the hospital that the mean weight
of newly born babies equal to 7 lbs. true?
› Null hypothesis: The mean weight of newly born babies is
equal to 7 lbs. (𝜇 = 7)
› Alternate hypothesis: The mean weight of newly born
babies is not equal to 7 lbs. (𝜇 ≠ 7)
(One-sample) T-test
Sample Problem
Running (one sample) t-test in JASP
➢Open 03 – Height and mass (MCT and
MOV).csv, this contains two columns of data
representing the height (cm) and body
masses (kg) of a sample population of males
used in a study.

➢In 2017 the average adult male in the UK


population was 178 cm tall and mass of 83.6
kg.
The descriptive data shows that the mean
height of the sample population was 177.6
cm compared to the average 178 cm UK
male.
Decision Making
➢Test for Normality (Shapiro-Wilk test)
p < 0.05 – Not Normally distributed (Non-parametric)
p ≥ 0.05 – Normally distributed (Parametric)
➢Hypothesis Testing
p < 0.05 – there is a significant difference (reject the
null hypothesis)
p ≥ 0.05 – there is no significant difference (accept
the null hypothesis)
Reporting the results
A one-sample t-test showed no
significant difference in height
compared to the population mean
t(22) = -0.382, p = 0.706).
INDEPENDENT SAMPLES T-TEST
comparing two independent groups
› The parametric independent t-test, also
known as Student’s t-test, is used to
determine of there is a statistical difference
between the means of two independent
groups.
› The test requires a continuous dependent
variable (e.g. body mass) and an independent
variable comprising two groups (e.g. males
and females).
INDEPENDENT SAMPLES T-TEST
comparing two independent groups
› This test produces a t-score which is a ration
of the differences between the two groups and
the differences within the two groups.
› A large t-score indicates that there is a greater
difference between groups. The smaller the t-
score the more similarity there is between
groups. The t score of 5 means that the
groups are five times as different from each
other as they are within each other.
INDEPENDENT SAMPLES T-TEST
comparing two independent groups
ASSUMPTIONS
1. Group Independence – Both groups must be
independent from each other.
2. Normality of the dependent variable – the
dependent variable should be measured on a
continuous scale and be approximately
normally distributed with no significant outliers.
This can be checked using the Shapiro-Wilk
test. A rule of thumb is that the ratio between
the group sizes should be <1.5 (i.e. group A =
12 participants and group B ≥ 8 participants).
INDEPENDENT SAMPLES T-TEST
comparing two independent groups

ASSUMPTIONS
3. Homogeneity of variance - The variances of
the dependent variable should be equal in each
group. This can be tested using Levene's Test of
Equality of Variances.
If Levene's Test is statistically significant,
indicating that the group variances are unequal we
can correct for this violation by using an adjusted
t-statistic based on the Welch method.
Reporting the results
An independent t-test showed
that females lost significantly
more weight over 10 weeks
dieting than males t(85)=6.16,
p<0.001.
PAIRED SAMPLES T-TEST
comparing two related groups
›The parametric paired samples t-test
(also known as the dependent
samples t-test or repeated measures
t-test).
›It compares the means between two
related groups on the same
continuous dependent variable.
PAIRED SAMPLES T-TEST
comparing two related groups

›For example, you want to find


out if there is a significant
difference in the weight loss
pre (before) and post (after) 10
weeks of dieting.
PAIRED SAMPLES T-TEST
comparing two related groups
›With the paired t-test, the null
hypothesis (H0) is that the pairwise
difference between two groups is
zero.

ഥ𝟏 = 𝒙
𝑯𝟎 : 𝒙 ഥ𝟐
ASSUMPTIONS of the Parametric
Paired-Samples T-test
➢The dependent variable should be measured
on a continuous scale.
➢Independent variable should consist of 2
categorical related/matched groups, i.e.
each participant is matched in each groups.
➢The differences between the matched pairs
should be approximately normally
distributed.
➢There should be no significant outliers.
PAIRED SAMPLES T-TEST
comparing two related groups
›Open 06 – Dieting A.csv
›This contains two columns of paired
data, pre-diet body mass and post 4
weeks of dieting.
›Go to T-tests → Paired samples t-
test → load the variables to analysis
box on the right.
Check the following:
Reporting the results
On average, participants lost
3.78 kg (SE: 0.29 kg) body mass
following a 4-week diet plan. A
paired samples t-test showed this
decrease to be significant
(t(77)=13.04, p<0.001).
One way - ANOVA (Analysis of Variance)
difference between three or more groups
›One-way analysis of variance
(ANOVA) compares the means of
three or more groups.
›ANOVA has been described as an
“omnibus test” that results in an F-
statistic which compares whether
there is a significant difference
between and within the groups.
One-way ANOVA (Analysis of Variance)
comparing three or more independent groups

ASSUMPTIONS
1. Independent variable must be categorical
and dependent variable must be
continuous.
2. The groups should be independent of each
other.
3. The dependent variable should be
approximately normal.
One-way ANOVA (Analysis of Variance)
comparing three or more independent groups
ASSUMPTIONS
4. There should be no significant outliers.
5. There should be homogeneity of variance
between the groups otherwise the p-value for
the F-statistics may not be reliable.

Note: The first 2 assumptions are usually


controlled through appropriate research
design.
One-way ANOVA (Analysis of Variance)
comparing three or more independent groups
ASSUMPTIONS
4. There should be no significant outliers.
5. There should be homogeneity of variance
between the groups otherwise the p-value for
the F-statistics may not be reliable.

Note: if the last three assumptions are violated


then the non-parametric, Kruskal-Wallis should
be considered instead.
One-way ANOVA (Analysis of Variance)
comparing three or more independent groups

Assumption checks:
Levene’s test - measures equality of variance
(homoscedasticity)
p < 0.05 - unequal variances (Use Welch
corrections / Brown-Forsythe
correction)
p ≥ 0.05 - equal variances (Use None)
One-way ANOVA (Analysis of Variance)
comparing three or more independent groups
Assumption
checks:

Q–Q Plot -
measures
normality
distribution of
data.
One-way ANOVA (Analysis of Variance)
comparing three or more independent groups
Hypothesis testing:
p < 0.05 - there is a significant difference
between means of the groups. (Reject the H0
and proceed to post hoc testing to identify
which group is significantly different.)
p ≥ 𝟎. 𝟎𝟓 - there is no significant
difference between the means of the groups.
(Accept the null and stop the analysis and
report the findings)
One-way ANOVA (Analysis of Variance)
comparing three or more independent groups

›If the ANOVA reports no


significant difference you can
go no further in the analysis.
One-way ANOVA (Analysis of Variance)
comparing three or more independent groups
Post-hoc testing
- are tests that were decided upon after
the data have been collected.
- It can only be carried out if the ANOVA F
test is significant.
Reporting the results
Independent one way ANOVA showed a
significant effect of the type of diet on weight
loss after 10 weeks (F (2, 69) =46.184,
p<.001.
Post hoc testing using Tukey’s correction
revealed that diet C resulted in significantly
greater weight loss than diet A (p<.001) or diet
B (p=.001). There were no significant
differences in weight loss between diets A and
B (p=.777).

You might also like