Unit 2
Unit 2
DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing
Data with Averages - Describing Variability -
WHAT IS STATISTICS?
Statistics exists because of the prevalence of variability in the real world.
Descriptive Statistics:
• In its simplest form, known as descriptive statistics, statistics provides us with
tools—tables,
• graphs, averages, ranges, correlations—for organizing and summarizing the
inevitable
• variability in collections of actual observations or scores.
• Eg: A tabular listing, ranked from most to least, A graph showing the annual
change in global temperature during the last 30 years
Inferential Statistics:
• Statistics also provides tools—a variety of tests and estimates—for generalizing
beyond collections of actual observations.
• This more advanced area is known as inferential statistics.
• Eg: An assertion about the relationship between job satisfaction and overall
happiness
Data
Quantitative Data:
Quantitative Data
• The weights reported by 53 male students in Table 1.1 are quantitative data, sinceany
single observation, such as 160 lbs, represents an amount of weight.
Ranked Data
• The ranked data in order from 1 to 15 depending on the data available in the list.
Qualitative Data
The Y and N replies of students in Table 1.2 are qualitative data, since any single
observation is a letter that represents a class of replies.
Approximate Numbers
• In theory, values for continuous variables can be carried out infinitely far.
• Eg: Someone’s weight, in pounds, might be 140.01438, and so on, to infinity!
• Practical considerations require that values for continuous variables be roundedoff.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
• For example, the weights of the to the nearest pound.
• A student whose weight is listed as 150 lbs could actually weigh between 149.5and
150.5 lbs.
Independent and Dependent Variables
• The most studies raise questions about the presence or absence of a relationship
between two (or more) variables.
• Eg: For example, a psychologist might wish to investigate whether couples who
undergo special training in “active listening” tend to have fewer communication
breakdowns than do couples who undergo no special training.
• An experiment is a study in which the investigator decides who receives the special
treatment.
Dependent Variable
• When a variable is believed to have been influenced by the independent variable,it is
called a dependent variable.
• In an experimental setting, the dependent variable is measured, counted, or
recorded by the investigator.
• Unlike the independent variable, the dependent variable isn’t manipulated by the
investigator.
• Instead, it represents an outcome: the data produced by the experiment.
• Eg: To test whether training influences communication, the psychologist countsthe
number of communication breakdowns between each couple
Observational Studies
• Instead of undertaking an experiment, an investigator might simply observe the
relation between two variables. For example, a sociologist might collect paired
measures of poverty level and crime rate for each individual in some group.
• Such studies are often referred to as observational studies.
• An observational study focuses on detecting relationships between variables not
manipulated by the investigator, and it yields less clear-cut conclusions about cause-
effect relationships than does an experiment.
Confounding Variable
• Whenever groups differ not just because of the independent variable but also because
some uncontrolled variable co-varies with the independent variable, any conclusion
about a cause-effect relationship is suspect.
• A difference between groups might be due not to the independent variable but to a
confounding variable.
• For instance, couples willing to devote extra effort to special training might already
possess a deeper commitment that co-varies with more active-listening skills.
• An uncontrolled variable that compromises the interpretation of a study is knownas a
confounding variable.
Problems:
III. DESCRIBING DATA WITH TABLES AND GRAPHS:
• To organize the weights of the male statistics students listed in Table 1.1. First, arrange
a column of consecutive numbers, beginning with the lightest weight
(133) at the bottom and ending with the heaviest weight (245) at the top.
• A short vertical stroke or tally next to a number each time its value appears in the
original set of data; once this process has been completed, substitute for each tally
count a number indicating the frequency ( f ) of occurrence of each weight.
• When observations are sorted into classes of single values, as in Table 2.1, the result
is referred to as a frequency distribution for ungrouped data.
• The frequency distribution shown in Table 2.1 is only partially displayed because there
are more than 100 possible values between the largest and smallest observations.
Grouped Data
• When observations are sorted into classes of more than one value, as in Table 2.2,the
result is referred to as a frequency distribution for grouped data.
• Data are grouped into class intervals with 10 possible values each.
• The bottom class includes the smallest observation (133), and the top classincludes
the largest observation (245).
• The distance between bottom and top is occupied by an orderly series of classes.
• The frequency ( f ) column shows the frequency of observations in each class and,at
the bottom, the total number of observations in all classes.
Essential:
1. Each observation should be included in one, and only one, class.
Example: 130–139, 140–149, 150–159, etc.
2. List all classes, even those with zero frequencies.
Example: Listed in Table 2.2 is the class 210–219 and its frequency of zero.
3. All classes should have equal intervals.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–139, 140–
159, etc.,
Optional:
4. All classes should have both an upper boundary and a lower boundary.
Example: 240–249. Less preferred would be 240–above, in which no maximum value can
be assigned to observations in this class.
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . . 10, particularly 5
and 10 or multiples of 5 and 10.
Example: 130–139, 140–149, in which the class interval of 10 is a convenient number.
6. The lower boundary of each class interval should be a multiple of the class interval.
Example: 130–139, 140–149, in which the lower boundaries of 130, 140, are multiples of10,
the class interval.
7. Aim for a total of approximately 10 classes. Example:
The distribution in Table 2.2 uses 12 classes.
CONSTRUCTING FREQUENCY DISTRIBUTIONS
1. Find the range
2. Find the class interval required to span the range by dividing the range by the desired
number of classes
3. Round off to the nearest convenient interval
4. Determine where the lowest class should begin.
5. Determine where the lowest class should end.
6. Working upward, list as many equivalent classes as are required to include the largest
observation.
7. Indicate with a tally the class in which each observation falls.
8. Replace the tally count for each class with a number—the frequency (f )—and showthe
total of all frequencies.
9. Supply headings for both columns and a title for the table.
Problems:
OUTLIERS
• The appearance of one or more very extreme scores are called outliers.
• Ex: A GPA of 0.06, an IQ of 170, summer wages of $62,000
Problem:
RELATIVE FREQUENCY DISTRIBUTIONS
Percentages or Proportions?
• A proportion always varies between 0 and 1, whereas a percentage always varies
between 0 percent and 100 percent.
• To convert the relative frequencies in Table 2.5 from proportions to percentages,
multiply each proportion by 100; that is, move the decimal point two places to theright.
Problem:
Cumulative Percentages
• If relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages.
Percentile Ranks
• When used to describe the relative position of any score within its parent
distribution, cumulative percentages are referred to as percentile ranks.
• The percentile rank of a score indicates the percentage of scores in the entire
distribution with similar or smaller values than that score.
Problem:
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
• When, among a set of observations, any single observation is a word, letter, or
numerical code, the data are qualitative.
• Determine the frequency with which observations occupy each class, and report
these frequencies.
• This frequency distribution reveals that Yes replies are approximately twice as
prevalent as No replies.
• When inspecting a distribution for the first time, train yourself to look at the entire
table, not just the distribution.
• Read the title, column headings, and any footnotes.
• Where do the data come from? Is a source cited? Next, focus on the form of the
frequency distribution.
• When interpreting distributions, including distributions constructed by someone.
GRAPHS
• Data can be described clearly and concisely with the aid of a well-constructed
frequency distribution.
GRAPHS FOR QUANTITATIVE DATA
Histograms
Frequency Polygon
• An important variation on a histogram is the frequency polygon, or line graph.
• Frequency polygons may be constructed directly from frequency distributions.
• However, we will follow the step-by-step transformation of a histogram into a
frequency polygon.
• Stem and leaf displays are ideal for summarizing distributions, such as that for
weight data, without destroying the identities of individual observations.
Constructing a Display
• The leftmost panel of Table 2.9 re-creates the weights of the 53 male statistics
students listed in Table 1.1.
• To construct the stem and leaf display for these data, when counting by tens, the
weights range from the 130s to the 240s.
• Arrange a column of numbers, the stems, beginning with 13 (representing the
130s) and ending with 24 (representing the 240s).
• Draw a vertical line to separate the stems, which represent multiples of 10, fromthe
space to be occupied by the leaves, which represent multiples of 1.
• Next, enter each raw score into the stem and leaf display.
Interpretation
• The weight data have been sorted by the stems. All weights in the 130s are listed
together; all of those in the 140s are listed together, and so on.
• A glance at the stem and leaf display in Table 2.9 shows essentially the same pattern
of weights depicted by the frequency distribution in Table 2.2 and the histogram.
Selection of Stems
• Stem values are not limited to units of 10.
• Depending on the data, you might identify the stem with one or more leading digitsthat
culminates in some variation on a stem value of 10, such as 1, 100, 1000, or even .1,
.01, .001, and so on.
• Stem and leaf displays represent statistical bargains.
Problem:
TYPICAL SHAPES
• Whether expressed as a histogram, a frequency polygon, or a stem and leaf display,
an important characteristic of a frequency distribution is its shape.
Normal
• Any distribution that approximates the normal shape in panel A of Figure 2.3 can be
analyzed
• The familiar bell-shaped silhouette of the normal curve can be superimposed on many
frequency distributions, Eg: uninterrupted gestation periods of human fetuses, scores
on standardized tests, and even the popping times of individual kernels in a batch of
popcorn.
Bimodal
• Any distribution that approximates the bimodal shape in panel B of Figure 2.3 reflect
the coexistence of two different types of observations in the same distribution.
• Eg: The distribution of the ages of residents in a neighborhood consisting largely of
either new parents or their infants has a bimodal shape.
Positively Skewed
• The two remaining shapes in Figure 2.3 are lopsided.
• A lopsided distribution caused by a few extreme observations in the positive
direction as in panel C of Figure 2.3, is a positively skewed distribution.
• Eg: most family incomes under $200,000 and relatively few family incomesspanning
a wide range of values above $200,000.
Negatively Skewed
• A lopsided distribution caused by a few extreme observations in the negative
direction as in panel D of Figure 2.3, is a negatively skewed distribution.
• Eg: Most retirement ages at 60 years or older and relatively few retirement ages
spanning the wide range of ages younger than 60.
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
• The equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data.
• Likewise, equal segments along the vertical axis reflect increases in frequency.
• The body of the bar graph consists of a series of bars whose heights reflect the
frequencies for the various words or classes.
• A person’s answer to the question “Do you have a Facebook profile?” is either Yesor
No, not some impossible intermediate value, such as 40 percent Yes and 60 percent
No.
MISLEADING GRAPHS
• Graphs can be constructed in an unscrupulous manner to support a particular point
of view.
• For example, to imply that comparatively many students responded Yes to the
Facebook profile question, an unscrupulous person might resort to the various tricks.
• The width of the Yes bar is more than three times that of the No bar, thus violatingthe
custom that bars be equal in width.
• The lower end of the frequency scale is omitted, thus violating the custom that the
entire scale be reproduced, beginning with zero.
• The height of the vertical axis is several times the width of the horizontal axis, thus
violating the custom, heretofore unmentioned, that the vertical axis be approximately
as tall as the horizontal axis is wide.
Problem:
IV. DESCRIBING DATA WITH AVERAGES:
MODE
• The mode reflects the value of the most frequently occurring score.
Progress Check *3.1 Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
mode = 63
Progress Check *3.2 The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9. Find
the mode for these data.
mode = 27.4
MEDIAN
• The median reflects the middle value when observations are ordered from leastto
most.
• The median splits a set of ordered observations into two equal parts, the upperand
lower halves.
• In other words, the median has a percentile rank of 50, since observations with
equal or smaller values constitute 50 percent of the entire distribution.
• To find the median, scores always must be ordered from least to most
• When the total number of scores is odd, as in the lower left-hand panel of Table 3.2,
there is a single middle-ranked score, and the value of the median equals the value of
this score.
• When the total number of scores is even, as in the lower right-hand panel of Table3.2,
the value of the median equals a value midway between the values of the two
middlemost scores.
• In either case, the value of the median always reflects the value of middle-ranked
scores, not the position of these scores among the set of ordered scores.
• The median term can be found for the 20 presidents.
Problems:
Progress Check *3.3 Find the median for the following retirement ages: 60, 63, 45, 63,
65, 70, 55, 63, 60, 65, 63.
median = 63
Progress Check *3.4 Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
median = 27.15
MEAN
• The mean is the most common average, calculated many times.
• The mean is found by adding all scores and then dividing by the number ofscores.
•
• To find the mean term for the 20 presidents, add all 20 terms in Table 3.1 (4 + . . .
+ 4 + 8) to obtain a sum of 112 years, and then divide this sum by 20, the number
of presidents, to obtain a mean of 5.60 years.
Sample or Population?
• Statisticians distinguish between two types of means—the population mean and the
sample mean—depending on whether the data are viewed as a population (a
complete set of scores) or as a sample (a subset of scores).
Problems:
Progress Check *3.5 Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.
Progress Check *3.6 Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
WHICH AVERAGE?
If Distribution Is Not Skewed
• When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.
If Distribution Is Skewed
• When extreme scores cause a distribution to be skewed, as for the infant death rates
for selected countries listed in Table 3.4, the values of the three averages candiffer.
• The mean is the single most preferred average for quantitative data.
• An average can refer to the mode, median, or mean—or even geometric mean or
the harmonic mean.
• Conventional usage prescribes that average usually signifies mean, and this
connotation is often reinforced by the context.
• For instance, grade point average is virtually synonymous with mean grade point.
• But when the data are qualitative, your choice among averages is restricted.
• The mode always can be used with qualitative data.
Inappropriate Averages
• It would not be appropriate to report a median for unordered qualitative data with
nominal measurement, such as the ancestries of Americans.
Problem:
V. DESCRIBING VARIABILITY:
• In Figure 4.1, each of the three frequency distributions consists of seven scores with
the same mean (10) but with different variabilities.
• Before reading on, rank the three distributions from least to most variable.
• The distribution A has the least variability, distribution B has intermediate variability,
and distribution C has the most variability.
• For distribution A with the least (zero) variability, all seven scores have the same value
(10).
• For distribution B with intermediate variability, the values of scores vary slightly (one
9 and one 11), and for distribution C with most variability, they vary even more (one
7, two 9s, two 11s, and one 13).
Importance of Variability
• Variability assumes a key role in an analysis of research results.
• Eg: A researcher might ask: Does fitness training improve, on average, the scores of
depressed patients on a mental-wellness test?
• To answer this question, depressed patients are randomly assigned to two groups,
fitness training is given to one group, and wellness scores are obtained for both
groups.
• Figure 4.2 shows the outcomes for two fictitious experiments, each with the same
mean difference of 2, but with the two groups in experiment B having less variability
than the two groups in experiment C.
• Notice that groups B and C in Figure 4.2 are the same as their counterparts in Figure
4.1.
• Although the new group B* retains exactly the same (intermediate) variability as
group B, each of its seven scores and its mean have been shifted 2 units to the right.
• Likewise, although the new group C* retains exactly the same (most) variability asgroup
C, each of its seven scores and its mean have been shifted 2 units to the right.
• Consequently, the crucial mean difference of 2 (from 12 − 10 = 2) is the same for both
experiments.
RANGE
• The range is the difference between the largest and smallest scores.
• In Figure 4.1, distribution A, the least variable, has the smallest range of 0 (from 10 to
10); distribution B, the moderately variable, has an intermediate range of 2 (from 11
to 9); and distribution C, the most variable, has the
• largest range of 6 (from 13 to 7), in agreement with our intuitive judgments about
differences in variability.
Disadvantages of Range
• The range has several shortcomings.
• First, since its value depends on only two scores—the largest and the smallest—it
fails to use the information provided by the remaining scores.
• The value of the range tends to increase with increases in the total number of
scores.
VARIANCE
• Variance and Standard Deviation are the two important measurements in
statistics.
• Variance is a measure of how data points vary from the mean.
• The standard deviation is the measure of the distribution of statistical data.
• For each of the three distributions in Figure 4.1, the face values of the sevenoriginal
scores have been re-expressed as deviation scores from their mean of 10.
• For example, in distribution C, one score coincides with the mean of 10, four scores
(two 9s and two 11s) deviate 1 unit from the mean, and two scores (one 7 and one
13) deviate 3 units from the mean, yielding a set of seven deviation scores: one 0,
two –1s, two 1s, one –3, and one 3.
• The sum of all negative deviations always counterbalances the sum of all positive
deviations, regardless of the amount of variability in the group.
• A measure of variability, known as the mean absolute deviation (or m.a.d.), can be
salvaged by summing all absolute deviations from the mean, that is, by ignoring
negative signs.
• Before calculating the variance (a type of mean), negative signs must be eliminated
from deviation scores. Squaring each deviation generates a set of squared deviation
scores, all of which are positive.
STANDARD DEVIATION
• The standard deviation, the square root of the mean of all squared deviations fromthe
mean, that is,
•
• The standard deviation is a rough measure of the average amount by which scores
deviate on either side of
• their mean.
• The standard deviation as a rough measure of the average amount by which scores
deviate on either side of their mean.
Majority of Scores within One Standard Deviation
For most frequency distributions, a majority of all scores are within one standard
deviation on either side of the mean.
• In Figure 4.3, where the lowercase letter s represents the standard deviation.
• As suggested in the top panel of Figure 4.3, if the distribution of IQ scores for a class
of fourth graders has a mean (X) of 105 and a standard deviation (s) of 15, amajority
of their IQ scores should be within one standard deviation on either sideof the mean,
that is, between 90 and 120.
• For most frequency distributions, a small minority of all scores deviate more thantwo
standard deviations on either side of the mean.
• For instance, among the seven deviations in distribution C, none deviates more than
two standard deviations (2 Å~ 1.77 = 3.54) on either side of the mean.
4.3 (a) False. Relatively few students will score exactly one standard deviation from the
mean.
(b) False. Students will score both within and beyond one standard deviation from the
mean.
(c) True
(d) True
(e) False. See (b).
(f) True
STANDARD DEVIATION
Sum of Squares (SS)
• Calculating the standard deviation requires that we obtain first a value for the
variance.
• However, calculating the variance requires, in turn, that we obtain the sum of the
squared deviation scores.
• The sum of squared deviation scores symbolized by SS, merits special attention
because it’s a major component in calculations for the variance, as well as many other
statistical measures.
Sum of Squares Formulas for Population
Standard Deviation for Population σ
I f μ Is Unknown
• It would be most efficient if, as above, we could use a random sample of ndeviations
expressed around the population mean, X − μ, to estimate variability inthe population.
• But this is usually impossible because, in fact, the population mean is unknown.
• Therefore, we must substitute the known sample mean, X, for the unknown
population mean, μ, and we must use a random sample of n deviations expressed
around their own sample mean, X –X, to estimate variability in the population.
• Although there are n = 5 deviations in the sample, only n − 1 = 4 of these deviationsare
free to vary because the sum of the n = 5 deviations from their own sample mean
always equals zero.
In Figure 5.2, the idealized normal curve has been superimposed on the original
distribution for 3091 men.
Interpreting the Shaded Area
• The total area under the normal curve in Figure 5.2 can be identified with all FBI
applicants.
• Viewed relative to the total area, the shaded area represents the proportion of
applicants who will be eligible because they are shorter than exactly 66 inches.
• Every normal curve can be interpreted in exactly the same way once any distancefrom
the mean is
• expressed in standard deviation units.
• For example, .68, or 68 percent of the total area under a normal curve—any normal
curve—is within one standard deviation above and below the mean, and only .05, or
5 percent, of the total area is more than two standard deviations aboveand below the
mean.
z SCORES
• A z score is a unit-free, standardized score that, regardless of the original units of
measurement, indicates how many standard deviations a score is above or below the
mean of its distribution.
• To obtain a z score, express any original score, whether measured in inches,
milliseconds, dollars, IQ points, etc., as a deviation from its mean
• where X is the original score and μ and σ are the mean and the standard deviation,
respectively, for the normal distribution of the original scores.
• Since identical units of measurement appear in both the numerator and denominator
of the ratio for z, the original units of measurement cancel each otherand the z score
emerges as a unit-free or standardized number, often referred to as a standard score.
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean; and
2. a number indicating the size of its deviation from the mean in standard deviation
units.
• A z score of 2.00 always signifies that the original score is exactly two standard
deviations above its mean.
• Similarly, a z score of –1.27 signifies that the original score is exactly 1.27 standard
deviations below its mean.
• A z score of 0 signifies that the original score coincides with the mean.
Converting to z Scores
• To answer the question about eligible FBI applicants, replace X with 66 (the maximum
permissible height), μ with 69 (the mean height), and σ with 3 (the standard deviation
of heights) and solve for z as follows:
STANDARD NORMAL CURVE
• If the original distribution approximates a normal curve, then the shift to standardor z
scores will always produce a new distribution that approximates the standardnormal
curve.
• The standard normal curve always has a mean of 0 and a standard deviation of 1.
However, to verify that the mean of a standard normal distribution equals 0, replace
X in the z score formula with μ, the mean of any normal distribution, and then solve
for z:
• Likewise, to verify that the standard deviation of the standard normal distribution
equals 1, replace X in the z score formula with μ + 1σ, the value corresponding to one
standard deviation above the mean for any (nonstandard) normal distribution, and
then solve for z:
• Although there is an infinite number of different normal curves, each with its ownmean
and standard deviation, there is only one standard normal curve, with a mean of 0 and
a standard deviation of 1.
• Converting all original observations into z scores leaves the normal shape intactbut
not the units of measurement.
• Shaded observations of 66 inches, 1080 hours, and 90 IQ points all reappear as az
score of –1.00.