Research Statistics
Art Walden A. Dejoras
Chapter 2
Descriptive Statistics
Recall: Descriptive Statistics
Descriptive statistics deals with the collection,
organization, summarization and presentation of
data.
Descriptive statistics ncludes the following:
•Frequency Distribution
•Measures of Central Tendency
•Measures of Variation
•Measures of Position
•Normal Distribution (Measures of Skewness & Kurtosis)
Frequency Distribution
Distribution is a list of scores taken on some particular
variable.
Example: The following is a distribution of 10 students'
scores on science test
69, 77, 77, 77, 84, 85, 85, 87, 92, 98
Frequency (f ) of a particular data set is the number of times
a particular observation occurs in the data.
Frequency Distribution is the pattern of frequencies of
observations or listing of case counts by category. It can show
either the actual number of observations falling in each range
or the percentage of observations (the proportion per
category). They can be shown using as frequency tables or
graphics.
Frequency Table is a chart presenting statistical data that
categorizes the values along with the frequency of each value.
Example 1:
Example 2:
Example 3:
Example 4: Grouped Frequency Distribution Table
f
Common Graphical Representations
of Frequency Distributions
• Bar Graphs
• Histogram
• Pie Chart
• Line Graph
Bar Graph
A bar graph is a chart with rectangular bars. The length
of each bar is proportional to the value it represents.
Bar graph is used for nominal and ordinal data to
indicate the frequency of distribution.
Bar Graph for Example 2
Vertical Bar Graph
Horizontal Bar Graph
Group Vertical Bar Graph
Stacked Bar Graph
Histogram
Histogram is a graph that consists of a series of
columns; each column represents an interval having a
category of a variable. The frequency of occurrence is
represented by the column's height.
Histogram is useful in graphically displaying interval
and ratio data.
Example
Example
Creating Histogram using SPSS
Step 1: Open SPSS.
Step 2: Click on the circle corresponding
to “Type in data”.
Step 3: Enter your data in one column.
Step 4: Click on “Variable View” tab.
Step 5: Type in name for the variable.
Step 6: In the measure column, pick “Scale”.
Step 7: Click Graphs > Legacy Dialogs > Histogram...
Step 8: Move your variable to “Variable:” box.
Step 9: Click “OK”.
Creating Histogram using SPSS (Output)
Pie Chart
A pie chart is a circular chart that provides a visual
representation of the data (100% = 360 degrees). The
pie is divided into sections that corresponds to a
category of the variable (i.e. age, height, etc.). The size
of the section is proportional to the percentage of the
corresponding category.
Pie charts are especially useful for summarizing nominal
data.
Example
Example
Line Graph
Line graph represents data that use points connected by
line in order to show the trend of changes in the data
on a given period of time (independent variable)at an
equally spaced interval (hourly, daily, weekly, monthly,
yearly, etc.).
Data (from dependent variable) like changes in
temperature and population can be represented by line
graph.
Example
Example
Example
Boxplot
A boxplot is a standardized way of displaying the
distribution of data based on a five number summary
(“minimum”, first quartile (Q1), median, third quartile
(Q3), and “maximum”). It can tell you about your
outliers and what their values are. It can also tell you if
your data is symmetrical, how tightly your data is
grouped, and if and how your data is skewed.
Five-Number Summary
1. First Quartile (Q1 or 25th Percentile): the middle number between
the smallest number and the median of the dataset.
2. Median (Q2 or 50th Percentile): the middle value of the dataset.
3. Third Quartile (Q3 or 75th Percentile): the middle value between
the median and the highest value of the dataset.
Formula for ith Quartile: observation
where the n number of observations are arranged from least to
highest.
If Q1 and Q3 are not integers, the quartiles are found by
interpolation.
Interquartile Range (IQR): 25th to the 75th percentile.
Formula: IQR = Q3 - Q1
4. Maximum: The highest observation in the data set after
the outliers are removed
5. Minimum: The least observation in the data set after the
outliers are removed
Outliers are extreme observations in the data set.
If an observation is greater than Q3 + 1.5(IQR) or
less than Q1 - 1.5(IQR), then it is considered as an
outlier.
Example:
Give the 5-number summary for the following data set. Then
draw a boxplot illustrating the 5-number summary. Determine
if
there exists an outlier in the given data set.
{1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6, 6, 8, 8, 9, 10, 27}.
Creating Boxplot using SPSS
Given: {1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6, 6, 8, 8, 9, 10, 27}.
Step 1: Open SPSS.
Step 2: Click on the circle corresponding to “Type in data”.
Step 3: Click on “Variable View” tab.
Step 4: Type in name for the variable corresponding to scores.
Step 5: In the measure column, pick “Scale”.
Step 6: Type in name for the variable corresponding to group.
Step 7: In the measure column, pick “Nominal”.
Step 8: Click on “Data View” tab.
Step 9: Enter the data in the column corresponding each variable.
Step 10: Click Graphs > Legacy Dialogs > Boxplot... > Simple > Define
Step 11: Move your score variable to “Variable:” box.
Step 12: Move your group variable to :Category Axis:” box.
Step 13: Click Ok.
Creating Boxplot using SPSS (OUTPUT)
Given: {1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6, 6, 8, 8, 9, 10, 27}.
Question:
Consider the results of the post-test from students for the year 2000
and the year 2010 being illustrated in boxplots. What do these results
tell us about how students performed on the 29-question post-test for
the two years?
Answer:
If we compare only the lowest and highest
scores between the two years, we might
conclude that the students in 2010 did better
than the students in 2010. This conclusion
seems to follow since the lowest score of 8 in
2010 is greater in value than the lowest score of
6 in 2000. Also, the highest score of 28 in 2010
is greater in value than the highest score of 27
in 2000.
But the box portion of the illustration gives us
more detailed information. The middle bar in
each box shows us that the median score of 20
in 2000 is greater in value than the median
score of 17 in 2010. Further, we note that the
box and whiskers divide the illustration into four
pieces. Each of these four pieces represents
the same portion of students. So, the upper half
of the students in 2000 scored in the same
score range as the upper one-fourth of the
students in 2010, see the illustration at a score
of 20.
By considering the upper one-fourth,
upper half, and upper three-fourths
instead of just the lowest and highest
scores, we would conclude that the
students as a whole did much better in
2000 than in 2010. We would conclude
that as a whole the students in 2010 are
less prepared than the students in 2000.
Consider this...
What is the representative height
of this group of students?
If you were to join any of these two teams, which team would you
choose? Why?
Measures of Central Tendency
Measure of Central Tendency (Measure of
Average) is a single value used to represent or
summarize the entire data set. It describe where the
data are centered.
Three Measures of Central Tendency.
•Mean
•Median
•Mode
Mean
Median
Mode
Levels of Measurement and the
Applicable Measure of Central Tendency
Levels of Measurement and the Best
Measure of Central Tendency
Advantages & Disadvantages of the
Measures of Central Tendency
Question:
Consider the scores of 14 students from each of the two
classes in the 50-point mathematics ability test.
Which class is a better
performer in terms of
mathematics ability?
Why?
Introduction to Measures of Variability
Below are two sets of employee performance scores taken from two
work divisions in a company. Assume that the two sets of data are
populations.
The two divisions obtain the same mean score of 77.75 points in the
test. Can we say that the quality of performance in the two divisions
are identical?
very poor score
very high score
The scores of Division A are more widely dispersed compared to
Division B
The scores of Division B are more closely located about the mean
of 77.75 indicating a more consistent or homogeneous set of
workers in terms of performance.
Histograms
Boxplots
Measure of central tendency
presents only half of the picture
or the description of the data.
Measure of variability completes
the description of the data.
Measure of Variability
Measure of Variability is a single number that tells how
varied the observations are in a distribution. It is also called
measure of variability, dispersion or spread.
Three Common Measures of Variation are Range, Variance &
Standard Deviation
The closer the measure of variation is from zero, the less
varied or homogeneous the observations are.
The farther the measure of varaition is from zero, the more
varied or the more heterogeneous the observations are.
FORMULAS
Range = Highest Value - Lowest Value
Example: Determine the range, standard deviation and variance
of the 10 students' scores on Science test.
Interpretation of Quantitative Data using the
Mean and Standard Deviation
⌘ If we want to describe a data set, it may sometimes be useful to
present a frequency distribution table or graphics, but these are
usually more information than is needed. Two items are often
sufficient:
※ Mean - A measure that will tell us what a typical member of
data set is like.
※ Standard Deviation - A measures which tells us about how
spread out the other members of the data set are around the
mean.
Consider this...
In this distribution, there are only few fish with extreme length and
many have average length... so NORMAL.
This distribution looks like a NORMAL DISTIBUTION.
The Normal Distribution
Normal distribution is a frequency distribution that
follows a normal curve (or bell-shaped curve).
Properties:
Empirical Rule
For normally distributed data:
• Approximately 68% of the data values will be within
standard deviation of the mean.
•Approximately 95% of the data values will be within
standard deviation of the mean.
•Almost all of the data values will fall within
standard deviations of the mean.
Data falling beyond standard deviations from the mean are OUTLIERS.
Example: Suppose the scores of students in a 35-Point Biology
achievement test are normally distributed with mean of 21
and standard deviation of 3.2. Illustrate the associated
normal curve.
68% of the students got a score between 17.8 and 24.2 points.
95% of the students got a score between 14.6 and 27.4 points.
Almost all students got a score between 11.4 and 30.6
points.
Exercise
Suppose the scores of students in a 50-Point Physics
achievement test are normally distributed with mean of 32
and standard deviation of 4.5.
Show on the normal curve the range of test scores
obtained by 68%, 95% and almost all of students who
took the test.
Effect of Mean and Standard
Deviation on Normal Distributions
Effect of Mean and Standard
Deviation on Normal Distributions
Effect of Mean and Standard
Deviation on Normal Distributions
Skewness
Skewness refers to the measure of the symmetry/asymmetry of a
frequency distribution. Its formula is
If Sk < 0 , then you have a negatively skewed distribution.
If Sk = 0 , then you have a normal distribution.
If Sk > 0 , then you have a positively skewed distribution.
Normal & Skewed Distributions
Examples of a Normal Distribution
In general:
□SAT Scores
□Heights of People
□IQ Test Scores
□Intelligence Test Scores
□Psychological Test Scores
□Behavioral Test Scores
□Ability Test Scores
Example of a Positively Skewed Distribution
(Skewed to the Right)
Example of a Negatively Skewed Distribution
(Skewed to the Left)
SPSS Interpretation of Skewness
In SPSS, if -1 < skewness < 1, then the distribution
can be considered within the range of normality.
• If skewness is less than or equal to - 1, the
distribution is highly negatively skewed.
• If skewness is greater than or equal to +1, the
distribution is highly positively skewed.
Kurtosis
Kurtosis is a measure that describes the shape of a
distribution's tails in relation to its overall shape.
Leptokurtic Distribution is characterized with long
tails.
Mesokurtic Distribution is similar to the normal
distribution.
Platykurtic Distribution is characterized with short
tails.
Formula and Interpretation of
Kurtosis k
The distribution is said to be mesokurtic if k = 3, leptokurtic if k > 3
and platykurtic if k < 3.
SPSS tool reports that value in excess of 3. Thus, in SPSS, the
distribution is said to be mesokurtic if k=0, leptokurtic if k > 0 and
platykurtic if k < 0. If -1 < k < 1, then the distribution can be
considered within the range of normality.
Both the skewness and kurtosis of
the distribution should be between
-1 and 1 in order to assume the
distribution to be approximately
normally distributed.
Running Descriptive Statistics
in SPSS
Consider the folowing data on 30 students' 50-point mathematics
achievement test.
29 38 29 28 42 40
32 32 37 35 28 46
41 41 40 26 28 29
44 44 46 27 31 27
32 32 29 33 27 25
Determine the mean, standard deviation, skewness and kurtosis.
Determine if the distribution is approximately distributed.
Output :
Reporting:
The mean score of the students is 33.93 (SD = 6.63).The scores
are non-normally distributed with skewness of 0.50 (SE = 0.43)
and kurtosis of -1.17 (SE = 0.83).
Alternative Tests of Normality:
• Kolmogorov-Smirnov Test
• Shapiro-Wilk Test
Normality Tests:
□Shapiro-Wilk test
if sample size is less than 50
□Kolmogorov-Smirnov test
if sample size is greater than 50
If a significance level (or sig.) is greater
than 0.05, then normality can be assumed.
Another Way of Running Descriptives in
SPSS....