17 A Introduction To Descriptive Statistics and Exploratory Data Analysis

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 47

Descriptive Statistics and

Exploratory Data Analysis

Ambo University

Ambo University TEFL 702


Instructor: Mulugeta Teka (PhD) 1
Introduction
• Social and behavioral scientists need statistics more
than most other scientists.
• As a matter of fact, humans vary so much from each
other along every conceivable dimension
• This fact creates a particular need to summarize all
this variability in order to make sense of it.
• The purpose of descriptive statistics is to use just a few
numbers to capture the meaning of a much larger
collection of observations on many different cases.
• e.g. people, schools or colleges; or the same cases on
many different occasions; or some combination of the
two.
Ambo University TEFL 702
2
Instructor: Mulugeta Teka (PhD)
Types of Statistics
• Descriptive statistics are used to classify and summarize
numerical data, i.e., to describe data.
• e.g., mean, percentage, standard deviation, frequency
• Often, computing descriptive statistics is just our first step in a
process that uses more advanced statistical methods
• Inferential statistics are techniques/ methods which allow
us to explore in-depth relationship between variables or to
make estimates about cases that we will never have the
opportunity to measure directly.
• They consist of procedures for making generalizations about a
population by studying a sample from the population.
• e.g., T-Tests, ANOVA, Regression

Ambo University TEFL 702


3
Instructor: Mulugeta Teka (PhD)
Descriptive Statistics & Exploratory Data Analysis
• Descriptive statistics allow you to summarize the properties of
an entire distribution of scores with just a few numbers.
• Although descriptive statistics are commonly used to address the
specific questions that you had in mind when you designed your
study, they also can be employed to help you discover important
but perhaps hidden patterns in your data that may shed
additional light on the problems you are interested in resolving.
• The search for such patterns in your data is termed exploratory
data analysis (EDA) .
• When research has been designed to answer a specific question
or set of questions, there is a strong temptation to rush directly
to the inferential statistical techniques that will assess the
“statistical significance” of the findings and to request only those
descriptive statistics related directly to the analysis, such as
group means, standard deviations, and standard errors.
• Resist this temptation.

Ambo University TEFL 702


4
Instructor: Mulugeta Teka (PhD)
Descriptive Statistics & Exploratory Data Analysis
• Many of the most commonly used inferential statistics make
certain crucial assumptions about the populations from which
the scores in your data set were drawn.
• If these assumptions are violated, the results of the statistical
analysis may be misleading.
• Some exploratory techniques help you spot serious defects in
your data that may warrant taking corrective action before
you proceed to the inferential analysis.
• Others help you determine which summary statistics would
be appropriate for a given set of data.
• Still others may reveal unsuspected influences.
• It is good to be familiar with a number of descriptive tools
(both numerical and graphical) for describing data and
revealing secrets that hide within.

Ambo University TEFL 702


5
Instructor: Mulugeta Teka (PhD)
Populations and samples
• Taking a sample from a population

Sample data ‘represents’ the whole population


ersity TEFL 702
or: Mulugeta Teka 6
(PhD)
Point estimation
Sample data is used to estimate parameters of a population
Statistics are calculated using sample data.
Parameters are the characteristics of population data

sample mean Population mean


estimates

Sample SD Population SD

Ambo University TEFL 702
Instructor: Mulugeta Teka (PhD) 7
Descriptive Statistics
• Descriptive statistics describe data in a way that allows the researcher to inform
readers about
– how often something occurred in the data,
– what typical values or elements were found in the outcomes, or
– how such values were dispersed throughout the data obtained.
• Typical statistics are
– measure of frequency,
– measure of central tendency (such as the mean, mode, or median), and
– measure of variability (typically the variance or standard deviation).
• All three measures can provide important insights into data and help us
understand them better.
• Finite information about our particular sample of subjects
• Not beyond this sample to the larger population.
• Inferential statistics centres around using a sample of subjects to represent a
population. A population is the group of interest. A random selection from this
population is called a sample. Statistics such as the mean and standard deviation
are calculated using the sample data and used to estimate the population
parameters.
Ambo University TEFL 702
8
Instructor: Mulugeta Teka (PhD)
Descriptive Statistics
• A measure of central tendency is the most obvious and most
typical descriptive statistic that summarizes all of the
observations with a single number—one that best locates the
middle of all the numbers.
• Mean (arithmetic, geometric and harmonic mean)
• Mode and
• Median
• (“X bar”) or M when it describes a sample, and when it
describes a population.
• In general, numbers that summarize the scores in a sample are
called statistics (e.g. is a statistic), whereas numbers that
summarize an entire population are called parameters (e.g., is
a parameter).
Ambo University TEFL 702
9
Instructor: Mulugeta Teka (PhD)
Measures of Central Tendency
• In many research situations, it is convenient to summarize your
data by applying descriptive statistics.
• A measure of center (also known as a measure of central tendency)
gives you a single score that represents the general magnitude of
scores in a distribution. This score characterizes your distribution by
providing information about the score at or near the middle of the
distribution. The most common measures of center are the mode,
the median, and the mean (also called the arithmetic average).
• Each measure of center has strengths and weaknesses. Also,
situations exist in which a given measure of center cannot be used
• In a symmetrical, unimodal distribution the three measures of
central tendency will all be in the same spot.
• However, in a skewed distribution extreme scores have a larger
effect on the mean than on the median, so while both of these
measures are pulled away from the mode, the mean is pulled
further.
Ambo University TEFL 702
10
Instructor: Mulugeta Teka (PhD)
Measures of Central Tendency (Mode)
• The mode is simply the most frequent score in a distribution.
• To obtain the mode, count the number of scores falling into
each response category. The response category with the highest
frequency is the mode. The mode of the distribution 1, 2, 4, 6, 4,
3, 4 is 4.
• No mode exists for a distribution in which all the scores are
different.
• Some distributions, called bimodal distributions, have two
modes.
• Although the mode is simple to calculate, it is limited because
the values of scores outside of the most frequent score are not
represented.
• The only information yielded by the mode is the most frequent
score. The values of other data in the distribution are not taken
into account.

Ambo University TEFL 702


11
Instructor: Mulugeta Teka (PhD)
Measures of Central Tendency (Mode)
• Under most conditions, take into account the other scores to
get an accurate characterization of your data.
• To illustrate this point, consider the following two distributions
of scores:
• 2, 2, 6, 3, 7, 2, 2, 5, 3, 1 and
• 2, 2, 21, 43, 78, 22, 33, 72, 12, 8.
• In both these distributions, the mode is 2. Looking only at the
mode, you might conclude that the two distributions are similar.
Obviously, this conclusion is incorrect.
• It is clear that the second distribution is very different from the
first.
• The mode may not represent a distribution very well and would
not be the best measure to use when comparing distributions

Ambo University TEFL 702


12
Instructor: Mulugeta Teka (PhD)
Measure of Central Tendency (Median)
• The Median is second measure of center.
• The median is the middle score in an ordered distribution. To
calculate the median, follow these steps:
• 1. Order the scores in your distribution from lowest to highest
(or highest to lowest, it does not matter).
• 2. Count down through the distribution and find the score in the
middle of the distribution. This score is the median of the
distribution.
• What is the median of the following distribution: 7, 5, 2, 9, 4, 8,
1?
• The correct answer is 5. The ordered distribution is 1, 2, 4, 5, 7,
8, 9, and 5 is the middle score.
• You may be wondering what to do if you have an even number
of scores in your distribution. In this case, there is no middle
score.

Ambo University TEFL 702


13
Instructor: Mulugeta Teka (PhD)
Measure of Central Tendency (Median)
• To calculate a median with an even number of scores, you order
the distribution as before and then identify the two middle
scores.
• The median is the average of these two scores. For example,
with the ordered distribution of 1, 3, 6, 7, 8, 9, the median is 6.5
(6+7= 13; 13/2 = 6.5).
• The median takes more information into account than the
mode.
• However, it is still a rather insensitive measure of center
because it does not take into account the magnitudes of the
scores above and below the median.
• As with the mode, two distributions can have the same median
and yet be very different in character.
• For this reason, the median is used primarily when the mean is
not a good choice.

Ambo University TEFL 702


14
Instructor: Mulugeta Teka (PhD)
Measure of Central Tendency (Mean)
• The Mean (denoted as M ) is the most sensitive measure of
center because it takes into account all scores in a distribution
when it is calculated. It is also the most widely used measure of
center.
• To obtain the mean, simply add together all the scores in the
distribution and then divide by the total number of scores (n).
• The major advantage of the mean is that, unlike the mode and
the median, its value is directly affected by the magnitude of
each score in the distribution.
• However, this sensitivity to individual score values also makes
the mean susceptible to the influence of outliers.
• One or two such outliers may cause the mean to be artificially
high or low.

Ambo University TEFL 702


15
Instructor: Mulugeta Teka (PhD)
Measure of Central Tendency (Mean)
• The following two distributions illustrate this point.
• Assume that Distribution A contains the scores 4, 6, 3, 8, 9, 2, 3,
and Distribution B contains the scores 4, 6, 3, 8, 9, 2, 43.
• Although the two distributions differ by only a single score (3
versus 43), they differ greatly in their means (5 versus 10.7,
respectively).
• The mean of 5 appears to be more representative of the first
distribution than the mean of 10.7 is of the second.
• The median is a better measure of center for the second
distribution. The medians of the two distributions are 4 and 6,
respectively—not nearly as different from one another as the
means.
• Before you choose a measure of center, carefully evaluate your
data for skewness and the presence of deviant, outlying scores.
• Do not blindly apply the mean just because it is the most
sensitive measure of center.
Ambo University TEFL 702
16
Instructor: Mulugeta Teka (PhD)
Choosing a Measure of Center
• Which of the three measures of center you choose depends on
two factors: the scale of measurement and the shape of the
distribution of the scores.
• Before you use any measure of center, evaluate these two
factors.
• If your data were measured on a nominal scale, you are limited to
using the mode.
• It makes no sense to calculate a median or mean sex, even if the
sex of subjects has been coded as 0s (males) and 1s (females).
• If your data were measured on an ordinal scale, you could
properly use either the mode or the median, but it would be
misleading to use the mean as your measure of center.
• This is because the mean is sensitive to the distance between
scores. With an ordinal scale, the actual distance between points
is unknown. You cannot assume that scores equally distant in
terms of rank order are equally far apart, but you do assume this
(in effect) if you use the mean.
Ambo University TEFL 702
17
Instructor: Mulugeta Teka (PhD)
Choosing a Measure of Center
• The mean can be used if your data are scaled on an interval or
ratio scale.
• On these two scales, the numerical distances between values are
meaningful quantities.
• Even if your dependent measure were scaled on an interval or
ratio scale, the mean may be inappropriate.
• One of the first things you should do when summarizing your
data is to generate a frequency distribution of the scores.
• Next, plot the frequency distribution as a histogram or stemplot
and examine its shape.
• If your scores are normally distributed (or at least nearly
normally distributed), then the mean, median, and mode will fall
at the same point in the middle of the distribution. When your
scores are normally distributed, use the mean as your measure
of center because it is based on the most information.
• As your distribution deviates from normality, the mean becomes
a less representative measure of center.
Ambo University TEFL 702
18
Instructor: Mulugeta Teka (PhD)
Measures of Spread/Variability
• Another important descriptive statistic you should apply
to your data is a measure of spread (also known as a
measure of variability ).
• When you conduct an experiment, it is extremely
unlikely that your subjects will all produce the same
score on your dependent measure.
• A measure of spread provides information that helps
you to interpret your data.
• Two sets of scores may have highly similar means yet
very different distributions, as the following example
illustrates.

Ambo University TEFL 702


19
Instructor: Mulugeta Teka (PhD)
Measures of Spread/Variability
• Imagine that you are a scout for a professional baseball team
and are considering one of two players for your team. Each
player has a .263 batting average over 4 years of college. The
distributions of the two players’ averages are as follows:
• Player 1: .260, .397, .200, .195
• Player 2: .263, .267, .259, .263
• Which of these two players would you prefer to have on your
team? Most likely, you would pick Player 2 because he is more
“consistent” than Player 1.
• This simple example illustrates an important point about
descriptive statistics.
• Range is the simplest way to measure the width of a distribution,
just the highest minus the lowest score. The range tells us the
largest difference that we have among our scores.

Ambo University TEFL 702


20
Instructor: Mulugeta Teka (PhD)
Measures of Spread/Variability
• The mean deviation is a good descriptive measure, which is less
affected by outliers than the standard deviation, but it is not
used in advanced statistics.
• If you square the deviations instead of taking their absolute
values, and then average these squared deviations, you will get
an important measure of variability, called variance. The
variance is not appropriate for descriptive purposes, but it plays
an important role in advanced statistics.
• Taking the square root of the variance produces a good
descriptive measure of variability that is called the standard
deviation.
• Standard deviation (s) is a measure of how much the individuals
differ from the mean.
• The standard deviation is a good descriptive measure of
variability. It also plays an important role in advanced statistics.
Ambo University TEFL 702
21
Instructor: Mulugeta Teka (PhD)
Interpretation of standard deviation
The larger the standard deviation, the more spread out the data
is. The smaller the SD, the less variation from the mean

Ambo University TEFL 702


22
Instructor: Mulugeta Teka (PhD)
Measures of Variability

Ambo University TEFL 702


23
Instructor: Mulugeta Teka (PhD)
Measures of Spread/Variability
• The Range is the simplest and least informative measure of spread.
• The range is simply the difference between the highest and lowest scores.
• Range gives an indication of the spread of the scores, but of course it depends
completely on just two figures from the whole set, the highest and the lowest.
• One very low or very high score will produce a large increase in the range, and
this might be quite misleading.
• To calculate the range, you simply subtract the lowest score from the highest
score.
• In the baseball example, the range for Player 1 is .202, and that for Player 2
is .008.
• Two problems with the range are that it does not take into account the
magnitude of the scores between the extremes and that it is very sensitive to
outliers in the distribution.
• Compare the following two distributions of scores: 1, 2, 3, 4, 5, 6 and 1, 2, 3, 4, 5,
31.
• The range for the first distribution is 5, and the range for the second is 30.
• The two ranges are highly discrepant despite the fact that the two distributions
are nearly identical.
• For these reasons, the range is rarely used as a measure of spread.
Ambo University TEFL 702
24
Instructor: Mulugeta Teka (PhD)
Measures of Spread/Variability
• One alternative measure is the interquartile range.
• The interquartile range is another measure of spread that is easy
to calculate.
• To obtain the interquartile range, follow these steps:
• 1. Order the scores in your distribution.
• 2. Divide the distribution into four equal parts (quarters).
• 3. Find the score separating the lower 25% of the distribution
(Quartile 1, or Q 1 ) and the score separating the top 25% from
the rest of the distribution (Q 3 ).
• The interquartile range is equal to Q 3 minus Q 1 (the difference
between the 25th and 75th percentiles).
• The interquartile range is less sensitive than the range to the
effects of extreme scores.
• It also takes into account more information because more than
just the highest and lowest scores are used for its calculation.
The interquartile range may be preferred.
Ambo University TEFL 702
25
Instructor: Mulugeta Teka (PhD)
Measures of Spread/Variability
• As mentioned earlier, the median is that score which divides the
set into two halves, with half the scores falling below the median
and half the scores falling above it.
• The median is the 50th percentile, which means 50% of the scores
fall below it. We can also have a 25th percentile, which is the score
below which 25% of the scores fall, a 75th percentile, a 90th
percentile etc.
• The semi-interquartile range is the interquartile range divided by
2.
• Unlike the range, the interquartile range is not affected by a single
score which is much greater or much less than the others.
• But it does use only two figures from the set to express the
variability in the set, and so ignores most of the numbers over the
range in situations in which you want a relatively simple, rough
measure of spread that is resistant to the effects of skew and
outliers.

Ambo University TEFL 702


26
Instructor: Mulugeta Teka (PhD)
Measures of Spread
• The variance is the average squared deviation from the mean.
• Although the variance is frequently used as a measure of
spread in certain statistical calculations, it does have the
disadvantage of being expressed in units different from those of
the summarized data.
• However, the variance can be easily converted into a measure
of spread expressed in the same unit of measurement as the
original scores: the standard deviation(s).
• To convert from the variance to the standard deviation, simply
take the square root of the variance.
• The standard deviation is the most popular measure of spread.

Ambo University TEFL 702


27
Instructor: Mulugeta Teka (PhD)
Choosing a Measure of Spread
• The choice of a measure of center is affected by the distribution of
the scores, and the same is true for the choice of a measure of
spread.
• Like the mean, the range and standard deviation are sensitive to
outliers. In cases in which your distribution has one or more
outliers, the interquartile range may provide a better measure of
spread.
• In addition to noting the presence of outliers, you should note the
shape of the distribution (normal or skewed) when selecting a
measure of spread.
• Remember that the mean is not a representative measure of
center when your distribution of scores is skewed and that the
mean is used to calculate the standard deviation.
• Consequently, with a skewed distribution, the standard deviation
does not provide a representative measure of spread.
• If your distribution is seriously skewed, use the interquartile range
instead.
Ambo University TEFL 702
28
Instructor: Mulugeta Teka (PhD)
Choosing a Measure of Spread
• Boxplots and the Five-Number Summary
• The five-number summary provides a useful way to boil
down a distribution into just a few easily grasped numbers,
several of which are resistant to the effects of skew and
outliers and all of which are based on the ranks of the scores.
• Included in the five number summary are the following: the
minimum, the first quartile, the median (second quartile), the
third quartile, and the maximum.
• The minimum and maximum are simply the smallest and
largest scores in the distribution; these are not resistant
measures for the simple fact that the most extreme outliers
will fall at the ends of the distribution and therefore are likely
to be the maximum or minimum scores.

Ambo University TEFL 702


29
Instructor: Mulugeta Teka (PhD)
Choosing summary statistics
Which measure of centre and
spread?

Scale Categorical

Normally Skewed data Ordinal: Nominal:


distributed Median Median Mode
Mean (Standard (Interquartile (Interquartile (None)
deviation) range) range)

Ambo University TEFL 702


30
Instructor: Mulugeta Teka (PhD)
Normal Curve & Distribution
• The concept of “grading on the curve” comes from a normal
distribution, where there are an equal but small number of As
and Fs, more Bs and Ds, and then lots of Cs in the middle.
• Graphing the frequency of each grade results in a bell-shaped
curve, with fewer grades at the extremes and most in the middle.
• As another example, you probably know very few, if any, adults
who are over 7 feet tall.
• Likewise, you probably know very few adults who are under 4
feet tall.
• Most people are between 5 feet and 6 feet tall, in the middle of
the bell-shaped curve.
• A distribution with fewer people (or scores) at the extremes and
more people in the middle is considered “normal.”
Ambo University TEFL 702
31
Instructor: Mulugeta Teka (PhD)
Normal Distribution
• Many variables are distributed normally, including height,
weight, IQ scores, SAT, and other achievement scores.
• Knowing that a variable is normally distributed turns out to be
quite valuable in research.
• If a variable is normally distributed, that is, forms a normal, or
bell-shaped, curve, then several things are true:
• 1. Fifty percent of the scores are above the mean, and 50% are
below the mean.
• 2. The mean, the median, and the mode have the same value.
• 3. Most scores are near the mean. The farther from the mean a
score is, the fewer the number of participants who attained
that score.

Ambo University TEFL 702
32
Instructor: Mulugeta Teka (PhD)
Normal Distribution
• 4. For every normal distribution, 34.13% of the scores fall
between the mean and one standard deviation above the
mean, and 34.13% of the scores fall one standard deviation
below the mean (see Figure 12.1 ; find the midpoint on the
curve and look at the portions under the curve marked
34.13%).
• In other words, 68.26% of the scores are within one standard
deviation of the mean (34.13% 34.13%).
• More than 99% of the scores will fall somewhere between
three standard deviations above and three standard
deviations below the mean.
• Knowing where scores are on the normal curve helps us
understand where they are placed relative to the full dataset.

Ambo University TEFL 702


33
Instructor: Mulugeta Teka (PhD)
Displaying Distributions
• Frequency distributions take the form of tables or graphs.
• See the following table in the next slide which presents a hypothetical
frequency distribution of IQ scores using the classes just given.
• Because IQ scores are quantitative data, the classes are presented in
order of value from highest to lowest.
• To the right of each class is its frequency (f), the number of data
values falling into that class.
• Because there were no IQ scores below 65 or above 134, classes
beyond these limits are not tabled.
• Although a table provides a compact summary of the distribution, it is
not particularly easy to extract useful information from it about
center, spread, and shape.
• Graphical or semi-graphical displays are much better for this purpose.

Ambo University TEFL 702


34
Instructor: Mulugeta Teka (PhD)
Frequency Distribution Table of Hypothetical IQ Data

CLASS f

125–134 5
115–124 12
105–114 22
95–104 25
85–94 26
75–84 7
65–74 3
SF 100

Ambo University TEFL 702


35
Instructor: Mulugeta Teka (PhD)
Frequency Distributions
• A frequency distribution shows the number of times each score occurs in the
set of scores under examination.
• Histograms and bar charts are graphical displays of the frequency distribution
of the scores.
• Histograms resemble bar graphs, with each bar representing a class.
• Unlike the bars in a bar graph, those in a histogram are drawn touching to
indicate that there are no gaps between adjacent classes.
• A histogram is used for a continuous variable, and a bar chart for a categorical
variable. In a bar chart the bars are separated to indicate that they represent
different categories.
• In SPSS, any category which is empty (has a frequency of zero) will not be
shown in a bar chart.
• A frequency distribution can be symmetrical or skewed. The distribution is
skewed if the scores tend to be piled up at one end of the scale.
• If a distribution is roughly symmetrical, the mean can be used as the measure
of central tendency, but if it is skewed the median should be used rather than
the mean.
• A normal distribution, which is symmetrical, has a skewness statistic of zero.
• Kurtosis measures the extent to which observations are clustered in the tails.

Ambo University TEFL 702


36
Instructor: Mulugeta Teka (PhD)
The Stemplot
• As a quick alternative to the histogram, you might consider using a
stemplot (also known as a stem-and-leaf plot), which was invented
by statistician John Tukey (1977), to simplify the job of displaying
distributions.
• To create a stemplot of your data, you simply break each number
into two parts: stem and leaf.
• The stem part might consist, for example, of the leftmost column or
columns and the leaf part, the rightmost column.
• Thus, an IQ score of 67 would be broken into its leftmost number, or
stem (6), and rightmost number, or leaf (7).
• After finding the lowest and highest stems, make a column that
includes all the numbers in ascending order from lowest to highest
stem.
• Then draw a vertical line immediately to the right of the stem
column.

Ambo University TEFL 702


37
Instructor: Mulugeta Teka (PhD)
The Stemplot
• Finally, for each score in your data, find its stem number and then
write its leaf number on the same row immediately to the right of the
stem.
• So, the IQ score of 67 would look like the first entry at the top of the
figure 13-16 to be drawn.
• You do this for each number in your distribution.
• The final result would look something which plots some hypothetical
IQ data as a stemplot.
• Stemplots are easy to construct and display and have the
advantage over histograms and tables of preserving all the actual
values present in the data.
• However, you do not have much freedom to choose the class widths
because stemplots inherently create class widths of 10 (the span of
a stem).
• Stemplots are not especially useful for larger data sets because the
number of leaves becomes too large.
Ambo University TEFL 702
38
Instructor: Mulugeta Teka (PhD)
Examining Your Distribution
• When examining a histogram or stemplot of your data, look for
the following important features.
• First, locate the center of the distribution along the scale of
measurement.
• In the distribution plotted, are the scores centered around the
mean or somewhere else?
• The location of the center of a distribution tells you where the
scores tended to cluster along the scale of measurement.
• Second, note the spread of the scores.
• Do they tend to bunch up around the center or spread far from it?
• The spread of the scores indicates how variable they are.

Ambo University TEFL 702


39
Instructor: Mulugeta Teka (PhD)
Examining Your Distribution
• Third, note the overall shape of the distribution.
• Is it hill shaped, with a single peak at the center, or does it have
more than one peak?
• If hill shaped, is it more or less symmetrical, or is it skewed?

• There are two main ways in which a distribution can deviate


from normal:
• (1) lack of symmetry (called skew) and
• (2) pointyness (called kurtosis).

Ambo University TEFL 702


40
Instructor: Mulugeta Teka (PhD)
SKEWED DISTRIBUTIONS
• A skewed distribution has a long “tail” trailing off in one direction
and a short tail extending in the other.
• Skewed distributions are not symmetrical and instead the most
frequent scores (the tall bars on the graph) are clustered at one
end of the scale.
• A skewed distribution can be either positively or negatively
skewed.
• A distribution is positively skewed if the long tail goes off to the
right, upscale (the frequent scores are clustered at the lower end
and the tail points towards the higher or more positive scores)
• A distribution is negatively skewed if the long tail goes off to the
left, downscale (the frequent scores are clustered at the higher end
and the tail points towards the lower or more negative scores.
• The figure shows examples of these distributions.
Ambo University TEFL 702
41
Instructor: Mulugeta Teka (PhD)
Positively(left)and Negatively(right)Skewed
Distribution

Ambo University TEFL 702


42
Instructor: Mulugeta Teka (PhD)
KURTOSIS
• Distributions also vary in their kurtosis.
• Kurtosis refers to the degree to which scores cluster at the ends of
the distribution (known as the tails) and this tends to express
itself in how pointy a distribution is (but there are other factors
that can affect how pointy the distribution looks.
• A distribution with positive kurtosis has many scores in the tails (a
so-called heavy-tailed distribution) and is pointy. This is known as
a leptokurtic distribution.
• In contrast, a distribution with negative kurtosis is relatively thin in
the tails (has light tails) and tends to be flatter than normal. This
distribution is called platykurtic.
• Ideally, we want our data to be normally distributed( i.e., not too
skewed, and not too many or too few scores at the extremes!).
Ambo University TEFL 702
43
Instructor: Mulugeta Teka (PhD)
Distributions with positive kurtosis(leptokurtic,
left) and negative kurtosis(platykurtic,right)

Ambo University TEFL 702


44
Instructor: Mulugeta Teka (PhD)
Examining Your Distribution
• Because many common inferential statistics assume that the
data follow a normal distribution, check the distribution of your
data to see whether this assumption seems reasonable.
• The first way to check whether the distribution is normal or
skewed is to examine the shape of the distribution visually
(histograms and the corresponding P-P plots).
• We can see in the P-P plot whether the data points all fall very
close to or deviate away from the ‘ideal’ diagonal line.
• The second way is to interpret the values of skew and kurtosis.
• In a normal distribution the values of skew and kurtosis are 0
(i.e., the tails of the distribution are as they should be).
• If a distribution has values of skew or kurtosis above or below
0, then this indicates a deviation from normal.
Ambo University TEFL 702
45
Instructor: Mulugeta Teka (PhD)
Examining Your Distribution
• The values of S (skewness) and K (kurtosis) and their respective
standard errors are produced by SPSS.
• By dividing values of S (skewness) and K (kurtosis) by their respective
standard errors, we obtain the z-scores which can be compared
against values that you would expect to get if skew and kurtosis
were not different from 0.
• So, an absolute value greater than 1.96 is significant at p < .05, above
2.58 is significant at p < .01 and above 3.29 is significant at p < .001.
• However, you really should use the se criteria only in small samples:
in larger samples examine the shape of the distribution visually,
interpret the value of the skewness and kurtosis statistics, and
possibly don’t even worry about normality at all

Ambo University TEFL 702


46
Instructor: Mulugeta Teka (PhD)
Examining Your Distribution
• Another way of looking at the problem is to see whether the
distribution of scores deviates from a comparable normal
distribution.
• The Kolmogorov–Smirnov test and Shapiro–Wilk test compare
the scores in the sample to a normally distributed set of scores
with the same mean and standard deviation.
• If the test is non-significant (p > .05) it tells us that the
distribution of the sample is not significantly different from a
normal distribution (i.e.,it is probably normal).
• If, however, the test is significant (p<.05) then the distribution in
question is significantly different from a normal distribution (i.e.,
it is non-normal).These tests seem great: in one easy procedure
they tell us whether our scores are normally distributed (nice!).
Ambo University TEFL 702
47
Instructor: Mulugeta Teka (PhD)

You might also like