0% found this document useful (0 votes)
15 views66 pages

Unit 2

The document provides an overview of statistics, focusing on descriptive and inferential statistics, types of data, and methods for organizing and summarizing data using tables and graphs. It outlines three types of data (qualitative, ranked, and quantitative) and discusses variables, including discrete and continuous variables, as well as independent and dependent variables. Additionally, it covers frequency distributions, outliers, and various graphical representations such as histograms and frequency polygons.

Uploaded by

ilayaraja.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views66 pages

Unit 2

The document provides an overview of statistics, focusing on descriptive and inferential statistics, types of data, and methods for organizing and summarizing data using tables and graphs. It outlines three types of data (qualitative, ranked, and quantitative) and discusses variables, including discrete and continuous variables, as well as independent and dependent variables. Additionally, it covers frequency distributions, outliers, and various graphical representations such as histograms and frequency polygons.

Uploaded by

ilayaraja.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

UNIT II

DESCRIBING DATA

Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing
Data with Averages - Describing Variability -

WHAT IS STATISTICS?
Statistics exists because of the prevalence of variability in the real world.

Descriptive Statistics:
• In its simplest form, known as descriptive statistics, statistics provides us with
tools—tables,
• graphs, averages, ranges, correlations—for organizing and summarizing the
inevitable
• variability in collections of actual observations or scores.
• Eg: A tabular listing, ranked from most to least, A graph showing the annual
change in global temperature during the last 30 years

Inferential Statistics:
• Statistics also provides tools—a variety of tests and estimates—for generalizing
beyond collections of actual observations.
• This more advanced area is known as inferential statistics.
• Eg: An assertion about the relationship between job satisfaction and overall
happiness

I. THREE TYPES OF DATA:


• Data is a collection of actual observations or scores in a survey or an experiment.
• The precise form of statistical analysis often depends on whether data are
qualitative, ranked, or quantitative.

Data

Qualitative Ranked Quantitative


• Qualitative Data: Qualitative data consist of words (Yes or No), letters (Y or N), or
numerical codes (0 or 1) that represent a class or category.
• Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative
standing within a group.
• Quantitative data consists of numbers (weights of 238, 170, . . . 185 lbs) that
represent an amount or a count.

Quantitative Data:

Quantitative Data

• The weights reported by 53 male students in Table 1.1 are quantitative data, sinceany
single observation, such as 160 lbs, represents an amount of weight.
Ranked Data

• The ranked data in order from 1 to 15 depending on the data available in the list.

Qualitative Data
The Y and N replies of students in Table 1.2 are qualitative data, since any single
observation is a letter that represents a class of replies.

II. TYPES OF VARIABLES


• A variable is a characteristic or property that can take on different values.

Discrete and Continuous Variables


• Quantitative variables can be further distinguished in terms of whether they are
discrete or continuous.
• A discrete variable consists of isolated numbers separated by gaps.
• Examples include most counts, such as the number of children in a family; the number
of foreign countries you have visited; and the current size of the U.S. population.
• A continuous variable consists of numbers whose values, at least in theory, have no
restrictions.
• Examples include amounts, such as weights of male statistics students; durations,and
standardized test scores, such as those on the Scholastic Aptitude Test (SAT).

Approximate Numbers
• In theory, values for continuous variables can be carried out infinitely far.
• Eg: Someone’s weight, in pounds, might be 140.01438, and so on, to infinity!
• Practical considerations require that values for continuous variables be roundedoff.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
• For example, the weights of the to the nearest pound.
• A student whose weight is listed as 150 lbs could actually weigh between 149.5and
150.5 lbs.
Independent and Dependent Variables
• The most studies raise questions about the presence or absence of a relationship
between two (or more) variables.
• Eg: For example, a psychologist might wish to investigate whether couples who
undergo special training in “active listening” tend to have fewer communication
breakdowns than do couples who undergo no special training.
• An experiment is a study in which the investigator decides who receives the special
treatment.

Dependent Variable
• When a variable is believed to have been influenced by the independent variable,it is
called a dependent variable.
• In an experimental setting, the dependent variable is measured, counted, or
recorded by the investigator.
• Unlike the independent variable, the dependent variable isn’t manipulated by the
investigator.
• Instead, it represents an outcome: the data produced by the experiment.
• Eg: To test whether training influences communication, the psychologist countsthe
number of communication breakdowns between each couple

Observational Studies
• Instead of undertaking an experiment, an investigator might simply observe the
relation between two variables. For example, a sociologist might collect paired
measures of poverty level and crime rate for each individual in some group.
• Such studies are often referred to as observational studies.
• An observational study focuses on detecting relationships between variables not
manipulated by the investigator, and it yields less clear-cut conclusions about cause-
effect relationships than does an experiment.
Confounding Variable
• Whenever groups differ not just because of the independent variable but also because
some uncontrolled variable co-varies with the independent variable, any conclusion
about a cause-effect relationship is suspect.
• A difference between groups might be due not to the independent variable but to a
confounding variable.
• For instance, couples willing to devote extra effort to special training might already
possess a deeper commitment that co-varies with more active-listening skills.
• An uncontrolled variable that compromises the interpretation of a study is knownas a
confounding variable.

Problems:
III. DESCRIBING DATA WITH TABLES AND GRAPHS:

TABLES (FREQUENCY DISTRIBUTIONS)

• A frequency distribution is a collection of observations produced by sorting


observations into classes and showing their frequency (f ) of occurrence in each class.

• To organize the weights of the male statistics students listed in Table 1.1. First, arrange
a column of consecutive numbers, beginning with the lightest weight
(133) at the bottom and ending with the heaviest weight (245) at the top.
• A short vertical stroke or tally next to a number each time its value appears in the
original set of data; once this process has been completed, substitute for each tally
count a number indicating the frequency ( f ) of occurrence of each weight.
• When observations are sorted into classes of single values, as in Table 2.1, the result
is referred to as a frequency distribution for ungrouped data.
• The frequency distribution shown in Table 2.1 is only partially displayed because there
are more than 100 possible values between the largest and smallest observations.
Grouped Data
• When observations are sorted into classes of more than one value, as in Table 2.2,the
result is referred to as a frequency distribution for grouped data.
• Data are grouped into class intervals with 10 possible values each.
• The bottom class includes the smallest observation (133), and the top classincludes
the largest observation (245).
• The distance between bottom and top is occupied by an orderly series of classes.
• The frequency ( f ) column shows the frequency of observations in each class and,at
the bottom, the total number of observations in all classes.

Gaps between Classes:


• The size of the gap should always equal one unit of measurement.
• It should always equal the smallest possible difference between scores within a
particular set of data.
• Since the gap is never bigger than one unit of measurement, no score can fall into the
gap.

How Many Classes?


• Classes should not be too large and not too high.

When There Are Either Many or Few Observations:


• Grouping of classes can be 10, the recommended number of classes, as
recommended.

Real Limits of Class Intervals


• The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled boundary.
• Eg: The real limits for 140–149 in Table 2.2 are 139.5 (140 minus one-half of the unit
of measurement of 1) and 149.5 (149 plus one-half of the unit of measurement of 1),
and the actual width of the class interval would be 10 (from 149.5 139.5 = 10).
GUIDELINES

Essential:
1. Each observation should be included in one, and only one, class.
Example: 130–139, 140–149, 150–159, etc.
2. List all classes, even those with zero frequencies.
Example: Listed in Table 2.2 is the class 210–219 and its frequency of zero.
3. All classes should have equal intervals.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–139, 140–
159, etc.,

Optional:
4. All classes should have both an upper boundary and a lower boundary.
Example: 240–249. Less preferred would be 240–above, in which no maximum value can
be assigned to observations in this class.
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . . 10, particularly 5
and 10 or multiples of 5 and 10.
Example: 130–139, 140–149, in which the class interval of 10 is a convenient number.
6. The lower boundary of each class interval should be a multiple of the class interval.
Example: 130–139, 140–149, in which the lower boundaries of 130, 140, are multiples of10,
the class interval.
7. Aim for a total of approximately 10 classes. Example:
The distribution in Table 2.2 uses 12 classes.
CONSTRUCTING FREQUENCY DISTRIBUTIONS
1. Find the range
2. Find the class interval required to span the range by dividing the range by the desired
number of classes
3. Round off to the nearest convenient interval
4. Determine where the lowest class should begin.
5. Determine where the lowest class should end.
6. Working upward, list as many equivalent classes as are required to include the largest
observation.
7. Indicate with a tally the class in which each observation falls.
8. Replace the tally count for each class with a number—the frequency (f )—and showthe
total of all frequencies.
9. Supply headings for both columns and a title for the table.

Problems:
OUTLIERS
• The appearance of one or more very extreme scores are called outliers.
• Ex: A GPA of 0.06, an IQ of 170, summer wages of $62,000

Check for Accuracy


• Whenever you encounter an outrageously extreme value, such as a GPA of 0.06,
attempt to verify its accuracy.
• If the outlier survives an accuracy check, it should be treated as a legitimate score.

Might Exclude from Summaries


• We might choose to segregate (but not to suppress!) an outlier from any summaryof
the data.
• We might use various numerical summaries, such as the median and interquartile
range, to that ignore extreme scores, including outliers.

Might Enhance Understanding


• A valid outlier can be viewed as the product of special circumstances, it might helpyou
to understand the data.
• Eg: crime rates differ among communities

Problem:
RELATIVE FREQUENCY DISTRIBUTIONS

• Relative frequency distributions show the frequency of each class as a part or


fraction of the total frequency for the entire distribution.
• This type of distribution allows us to focus on the relative concentration of
observations among different classes within the same distribution.
• In the case of the weight data in Table 2.2, it permits us to see that the 160s accountfor
about one-fourth (12/53 = 23, or 23%) of all observations.

Constructing Relative Frequency Distributions


• To convert a frequency distribution into a relative frequency distribution, dividethe
frequency for each class by the total frequency for the entire distribution.

Percentages or Proportions?
• A proportion always varies between 0 and 1, whereas a percentage always varies
between 0 percent and 100 percent.
• To convert the relative frequencies in Table 2.5 from proportions to percentages,
multiply each proportion by 100; that is, move the decimal point two places to theright.
Problem:

CUMULATIVE FREQUENCY DISTRIBUTIONS


• Cumulative frequency distributions show the total number of observations in eachclass
and in all lower-ranked classes.
• This type of distribution can be used effectively with sets of scores, such as test scores
for intellectual or academic aptitude.
• Under these circumstances, cumulative frequencies are usually converted, in turn,to
cumulative percentages. Cumulative percentages are often referred to as percentile
ranks.

Constructing Cumulative Frequency Distributions


• To convert a frequency distribution into a cumulative frequency distribution, addto
the frequency of each class the sum of the frequencies of all classes ranked below it.
• This gives the cumulative frequency for that class.
• Begin with the lowest-ranked class in the frequency distribution and work upward,
finding the cumulative frequencies in ascending order.

Cumulative Percentages
• If relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages.
Percentile Ranks
• When used to describe the relative position of any score within its parent
distribution, cumulative percentages are referred to as percentile ranks.
• The percentile rank of a score indicates the percentage of scores in the entire
distribution with similar or smaller values than that score.

Approximate Percentile Ranks (from Grouped Data)


• The assignment of exact percentile ranks requires that cumulative percentages be
obtained from frequency distributions for ungrouped data.
• If we have access only to a frequency distribution for grouped data, cumulative
percentages can be used to assign approximate percentile ranks.

Problem:
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
• When, among a set of observations, any single observation is a word, letter, or
numerical code, the data are qualitative.
• Determine the frequency with which observations occupy each class, and report
these frequencies.
• This frequency distribution reveals that Yes replies are approximately twice as
prevalent as No replies.

Ordered Qualitative Data


• Whether Yes is listed above or below No in Table 2.7.
• When, however, qualitative data have an ordinal level of measurement because
observations can be ordered from least to most, that order should be preserved inthe
frequency table.

Relative and Cumulative Distributions for Qualitative Data


• Frequency distributions for qualitative variables can always be converted into
relative frequency distributions.
• That a captain has an approximate percentile rank of 63 among officers since 62.5(or
63) is the cumulative percent for this class.
Problem:

INTERPRETING DISTRIBUTIONS CONSTRUCTED BY OTHERS

• When inspecting a distribution for the first time, train yourself to look at the entire
table, not just the distribution.
• Read the title, column headings, and any footnotes.
• Where do the data come from? Is a source cited? Next, focus on the form of the
frequency distribution.
• When interpreting distributions, including distributions constructed by someone.
GRAPHS

• Data can be described clearly and concisely with the aid of a well-constructed
frequency distribution.
GRAPHS FOR QUANTITATIVE DATA

Histograms

Important features of histograms:


• Equal units along the horizontal axis (the X axis, or abscissa) reflect the variousclass
intervals of the frequency distribution.
• Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency.
• The intersection of the two axes defines the origin at which both numerical scales
equal 0.
• Numerical scales always increase from left to right along the horizontal axis and
from bottom to top along the vertical axis.
• The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes.
• The adjacent bars in histograms have common boundaries that emphasize the
continuity of quantitative data for continuous variables.

Frequency Polygon
• An important variation on a histogram is the frequency polygon, or line graph.
• Frequency polygons may be constructed directly from frequency distributions.
• However, we will follow the step-by-step transformation of a histogram into a
frequency polygon.

A. This panel shows the histogram for the weight distribution.


B. Place dots at the midpoints of each bar top or, in the absence of bar tops, at midpointsfor
classes on the horizontal axis, and connect them with straight lines.
C. Anchor the frequency polygon to the horizontal axis.
D. Finally, erase all of the histogram bars, leaving only the frequency polygon.
Stem and Leaf Displays

• Stem and leaf displays are ideal for summarizing distributions, such as that for
weight data, without destroying the identities of individual observations.

Constructing a Display
• The leftmost panel of Table 2.9 re-creates the weights of the 53 male statistics
students listed in Table 1.1.
• To construct the stem and leaf display for these data, when counting by tens, the
weights range from the 130s to the 240s.
• Arrange a column of numbers, the stems, beginning with 13 (representing the
130s) and ending with 24 (representing the 240s).
• Draw a vertical line to separate the stems, which represent multiples of 10, fromthe
space to be occupied by the leaves, which represent multiples of 1.
• Next, enter each raw score into the stem and leaf display.
Interpretation
• The weight data have been sorted by the stems. All weights in the 130s are listed
together; all of those in the 140s are listed together, and so on.
• A glance at the stem and leaf display in Table 2.9 shows essentially the same pattern
of weights depicted by the frequency distribution in Table 2.2 and the histogram.
Selection of Stems
• Stem values are not limited to units of 10.
• Depending on the data, you might identify the stem with one or more leading digitsthat
culminates in some variation on a stem value of 10, such as 1, 100, 1000, or even .1,
.01, .001, and so on.
• Stem and leaf displays represent statistical bargains.

Problem:

TYPICAL SHAPES
• Whether expressed as a histogram, a frequency polygon, or a stem and leaf display,
an important characteristic of a frequency distribution is its shape.
Normal
• Any distribution that approximates the normal shape in panel A of Figure 2.3 can be
analyzed
• The familiar bell-shaped silhouette of the normal curve can be superimposed on many
frequency distributions, Eg: uninterrupted gestation periods of human fetuses, scores
on standardized tests, and even the popping times of individual kernels in a batch of
popcorn.

Bimodal
• Any distribution that approximates the bimodal shape in panel B of Figure 2.3 reflect
the coexistence of two different types of observations in the same distribution.
• Eg: The distribution of the ages of residents in a neighborhood consisting largely of
either new parents or their infants has a bimodal shape.

Positively Skewed
• The two remaining shapes in Figure 2.3 are lopsided.
• A lopsided distribution caused by a few extreme observations in the positive
direction as in panel C of Figure 2.3, is a positively skewed distribution.
• Eg: most family incomes under $200,000 and relatively few family incomesspanning
a wide range of values above $200,000.

Negatively Skewed
• A lopsided distribution caused by a few extreme observations in the negative
direction as in panel D of Figure 2.3, is a negatively skewed distribution.
• Eg: Most retirement ages at 60 years or older and relatively few retirement ages
spanning the wide range of ages younger than 60.
A GRAPH FOR QUALITATIVE (NOMINAL) DATA

• The equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data.
• Likewise, equal segments along the vertical axis reflect increases in frequency.
• The body of the bar graph consists of a series of bars whose heights reflect the
frequencies for the various words or classes.
• A person’s answer to the question “Do you have a Facebook profile?” is either Yesor
No, not some impossible intermediate value, such as 40 percent Yes and 60 percent
No.
MISLEADING GRAPHS
• Graphs can be constructed in an unscrupulous manner to support a particular point
of view.
• For example, to imply that comparatively many students responded Yes to the
Facebook profile question, an unscrupulous person might resort to the various tricks.
• The width of the Yes bar is more than three times that of the No bar, thus violatingthe
custom that bars be equal in width.
• The lower end of the frequency scale is omitted, thus violating the custom that the
entire scale be reproduced, beginning with zero.
• The height of the vertical axis is several times the width of the horizontal axis, thus
violating the custom, heretofore unmentioned, that the vertical axis be approximately
as tall as the horizontal axis is wide.
Problem:
IV. DESCRIBING DATA WITH AVERAGES:

MODE
• The mode reflects the value of the most frequently occurring score.

• Distributions can have more than one mode.


• Distributions with two obvious peaks, even though they are not exactly the same
height, are referred to as bimodal.
• Distributions with more than two peaks are referred to as multimodal.
• The presence of more than one mode might reflect important differences among
subsets of data.
Problems:

Progress Check *3.1 Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
mode = 63

Progress Check *3.2 The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9. Find
the mode for these data.

mode = 27.4

MEDIAN
• The median reflects the middle value when observations are ordered from leastto
most.
• The median splits a set of ordered observations into two equal parts, the upperand
lower halves.
• In other words, the median has a percentile rank of 50, since observations with
equal or smaller values constitute 50 percent of the entire distribution.

Finding the Median

• To find the median, scores always must be ordered from least to most
• When the total number of scores is odd, as in the lower left-hand panel of Table 3.2,
there is a single middle-ranked score, and the value of the median equals the value of
this score.
• When the total number of scores is even, as in the lower right-hand panel of Table3.2,
the value of the median equals a value midway between the values of the two
middlemost scores.
• In either case, the value of the median always reflects the value of middle-ranked
scores, not the position of these scores among the set of ordered scores.
• The median term can be found for the 20 presidents.

Problems:

Progress Check *3.3 Find the median for the following retirement ages: 60, 63, 45, 63,
65, 70, 55, 63, 60, 65, 63.
median = 63

Progress Check *3.4 Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.

median = 27.15

MEAN
• The mean is the most common average, calculated many times.
• The mean is found by adding all scores and then dividing by the number ofscores.


• To find the mean term for the 20 presidents, add all 20 terms in Table 3.1 (4 + . . .
+ 4 + 8) to obtain a sum of 112 years, and then divide this sum by 20, the number
of presidents, to obtain a mean of 5.60 years.
Sample or Population?
• Statisticians distinguish between two types of means—the population mean and the
sample mean—depending on whether the data are viewed as a population (a
complete set of scores) or as a sample (a subset of scores).

Formula for Sample Mean


• When symbols are used, X designates the sample mean, and the formula becomes and
reads: “X-bar equals the sum of the variable X divided by the sample size n.”

Formula for Population Mean


• The formula for the population mean differs from that for the sample mean only
because of a change in some symbols.
• The population mean is represented by μ (pronounced “mu”), the lowercase Greek
letter m for mean, where the uppercase letter N refers to the population size.
• Otherwise, the calculations are the same as those for the sample mean.

Mean as Balance Point


• The mean serves as the balance point for its frequency distribution.
• The mean serves as the balance point for its distribution because of a special
property:
• The sum of all scores, expressed as positive and negative deviations from the
mean, always equals zero.
• In its role as balance point, the mean describes the single point of equilibrium at
which, once all scores have been expressed as deviations from the mean.
• The mean reflects the values of all scores, not just those that are middle ranked (as
with the median), or those that occur most frequently (as with the mode).

Problems:
Progress Check *3.5 Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.

Progress Check *3.6 Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
WHICH AVERAGE?
If Distribution Is Not Skewed
• When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.

If Distribution Is Skewed
• When extreme scores cause a distribution to be skewed, as for the infant death rates
for selected countries listed in Table 3.4, the values of the three averages candiffer.

Interpreting Differences between Mean and Median


• When a distribution is skewed, report both the mean and the median.
• The differences between the values of the mean and median signal the presence of a
skewed distribution.
• If the mean exceeds the median, as it does for the infant death rates, the underlying
distribution is positively skewed because of one or more scores with relatively large
values, such as the very high infant death rates for a number of countries, especially
Sierra Leone.
• On the other hand, if the median exceeds the mean, the underlying distribution is
negatively skewed because of one or more scores with relatively small values.
Problem:
Special Status of the Mean

• The mean is the single most preferred average for quantitative data.

Using the Word Average

• An average can refer to the mode, median, or mean—or even geometric mean or
the harmonic mean.
• Conventional usage prescribes that average usually signifies mean, and this
connotation is often reinforced by the context.
• For instance, grade point average is virtually synonymous with mean grade point.

AVERAGES FOR QUALITATIVE AND RANKED DATA

Mode Always Appropriate for Qualitative Data

• But when the data are qualitative, your choice among averages is restricted.
• The mode always can be used with qualitative data.

Median Sometimes Appropriate


• The median can be used whenever it is possible to order qualitative data from
least to most because the level of measurement is ordinal.
• Do not treat the various classes as though they have the same frequencies when
they actually have different frequencies.

Inappropriate Averages
• It would not be appropriate to report a median for unordered qualitative data with
nominal measurement, such as the ancestries of Americans.
Problem:

Averages for Ranked Data


• When the data consist of a series of ranks, with its ordinal level of measurement,the
median rank always can be obtained.
• It’s simply the middlemost or average of the two middlemost ranks.

V. DESCRIBING VARIABILITY:

• In Figure 4.1, each of the three frequency distributions consists of seven scores with
the same mean (10) but with different variabilities.
• Before reading on, rank the three distributions from least to most variable.
• The distribution A has the least variability, distribution B has intermediate variability,
and distribution C has the most variability.
• For distribution A with the least (zero) variability, all seven scores have the same value
(10).
• For distribution B with intermediate variability, the values of scores vary slightly (one
9 and one 11), and for distribution C with most variability, they vary even more (one
7, two 9s, two 11s, and one 13).
Importance of Variability
• Variability assumes a key role in an analysis of research results.
• Eg: A researcher might ask: Does fitness training improve, on average, the scores of
depressed patients on a mental-wellness test?
• To answer this question, depressed patients are randomly assigned to two groups,
fitness training is given to one group, and wellness scores are obtained for both
groups.
• Figure 4.2 shows the outcomes for two fictitious experiments, each with the same
mean difference of 2, but with the two groups in experiment B having less variability
than the two groups in experiment C.
• Notice that groups B and C in Figure 4.2 are the same as their counterparts in Figure
4.1.
• Although the new group B* retains exactly the same (intermediate) variability as
group B, each of its seven scores and its mean have been shifted 2 units to the right.
• Likewise, although the new group C* retains exactly the same (most) variability asgroup
C, each of its seven scores and its mean have been shifted 2 units to the right.
• Consequently, the crucial mean difference of 2 (from 12 − 10 = 2) is the same for both
experiments.

• variabilities within groups assume a key role in inferential statistics.


• The relatively larger variabilities within groups in experiment C translate into less
statistical stability for the observed mean difference of 2 when it is viewed as justone
outcome among many possible outcomes for repeat experiments.
4.1 (a) small
(b) large

RANGE

• The range is the difference between the largest and smallest scores.
• In Figure 4.1, distribution A, the least variable, has the smallest range of 0 (from 10 to
10); distribution B, the moderately variable, has an intermediate range of 2 (from 11
to 9); and distribution C, the most variable, has the
• largest range of 6 (from 13 to 7), in agreement with our intuitive judgments about
differences in variability.

Disadvantages of Range
• The range has several shortcomings.
• First, since its value depends on only two scores—the largest and the smallest—it
fails to use the information provided by the remaining scores.
• The value of the range tends to increase with increases in the total number of
scores.
VARIANCE
• Variance and Standard Deviation are the two important measurements in
statistics.
• Variance is a measure of how data points vary from the mean.
• The standard deviation is the measure of the distribution of statistical data.

Reconstructing the Variance


• To qualify as a type of mean, the values of all scores must be added and thendivided
by the total number of scores.
• In the case of the variance, each original score is re-expressed as a distance or
deviation from the mean by subtracting the mean.

• For each of the three distributions in Figure 4.1, the face values of the sevenoriginal
scores have been re-expressed as deviation scores from their mean of 10.
• For example, in distribution C, one score coincides with the mean of 10, four scores
(two 9s and two 11s) deviate 1 unit from the mean, and two scores (one 7 and one
13) deviate 3 units from the mean, yielding a set of seven deviation scores: one 0,
two –1s, two 1s, one –3, and one 3.

Mean of the Deviations Not a Useful Measure

• The sum of all negative deviations always counterbalances the sum of all positive
deviations, regardless of the amount of variability in the group.
• A measure of variability, known as the mean absolute deviation (or m.a.d.), can be
salvaged by summing all absolute deviations from the mean, that is, by ignoring
negative signs.

Mean of the Squared Deviations

• Before calculating the variance (a type of mean), negative signs must be eliminated
from deviation scores. Squaring each deviation generates a set of squared deviation
scores, all of which are positive.

STANDARD DEVIATION
• The standard deviation, the square root of the mean of all squared deviations fromthe
mean, that is,


• The standard deviation is a rough measure of the average amount by which scores
deviate on either side of
• their mean.
• The standard deviation as a rough measure of the average amount by which scores
deviate on either side of their mean.
Majority of Scores within One Standard Deviation
For most frequency distributions, a majority of all scores are within one standard
deviation on either side of the mean.
• In Figure 4.3, where the lowercase letter s represents the standard deviation.
• As suggested in the top panel of Figure 4.3, if the distribution of IQ scores for a class
of fourth graders has a mean (X) of 105 and a standard deviation (s) of 15, amajority
of their IQ scores should be within one standard deviation on either sideof the mean,
that is, between 90 and 120.
• For most frequency distributions, a small minority of all scores deviate more thantwo
standard deviations on either side of the mean.
• For instance, among the seven deviations in distribution C, none deviates more than
two standard deviations (2 Å~ 1.77 = 3.54) on either side of the mean.

Generalizations Are for All Distributions


• These two generalizations about the majority and minority of scores are independent
of the particular shape of the distribution.

Standard Deviation: A Measure of Distance


• There’s an important difference between the standard deviation and mean.
• The mean is a measure of position, but the standard deviation is a measure of
distance. Figure 4.4 describes the weight distribution for the males.
• The mean (X) of 169.51 lbs has a particular position or location along thehorizontal
axis: It is located at the point, and only at the point, corresponding to
169.51 lbs.
• On the other hand, the standard deviation (s) of 23.33 lbs for the same distributionhas
no particular location along the horizontal axis.

Value of Standard Deviation Cannot Be Negative


• Standard deviation distances always originate from the mean and are expressed as
positive deviations above the
• The actual value of the standard deviation can be zero or a positive number, it can
never be a negative number because any negative deviation disappears when
squared.
Problem:

4.2 (a) $80,000 to $100,000


(b) $70,000
(c) $110,000
(d) $88,000 to $92,000; $86,000; $94,000

4.3 (a) False. Relatively few students will score exactly one standard deviation from the
mean.
(b) False. Students will score both within and beyond one standard deviation from the
mean.
(c) True
(d) True
(e) False. See (b).
(f) True

STANDARD DEVIATION
Sum of Squares (SS)
• Calculating the standard deviation requires that we obtain first a value for the
variance.
• However, calculating the variance requires, in turn, that we obtain the sum of the
squared deviation scores.
• The sum of squared deviation scores symbolized by SS, merits special attention
because it’s a major component in calculations for the variance, as well as many other
statistical measures.
Sum of Squares Formulas for Population
Standard Deviation for Population σ
I f μ Is Unknown
• It would be most efficient if, as above, we could use a random sample of ndeviations
expressed around the population mean, X − μ, to estimate variability inthe population.
• But this is usually impossible because, in fact, the population mean is unknown.
• Therefore, we must substitute the known sample mean, X, for the unknown
population mean, μ, and we must use a random sample of n deviations expressed
around their own sample mean, X –X, to estimate variability in the population.
• Although there are n = 5 deviations in the sample, only n − 1 = 4 of these deviationsare
free to vary because the sum of the n = 5 deviations from their own sample mean
always equals zero.

DEGREES OF FREEDOM (df)


• Degrees of freedom (df) refers to the number of values that are free to vary, givenone
or more mathematical restrictions, in a sample being used to estimate a population
characteristic.
• The concept of degrees of freedom is introduced only because we are using scoresin a
sample to estimate some unknown characteristic of the population.
INTERQUARTILE RANGE (IQR)
• The most important spinoff of the range, the interquartile range (IQR), is simply the
range for the middle 50 percent of the scores.

MEASURES OF VARIABILITY FOR QUALITATIVE AND RANKED DATA


Qualitative Data
• Measures of variability are virtually nonexistent for qualitative or nominal data.
• It is probably adequate to note merely whether scores are evenly divided among the
various classes, unevenly divided among the various classes, or concentrated mostly
in one class.
• For example, if the ethnic composition of the residents of a city is about evenly divided
among several groups, the variability with respect to ethnic groups is maximum; there
is considerable heterogeneity.
Ordered Qualitative and Ranked Data
• If qualitative data can be ordered because measurement is ordinal then it’s
appropriate to describe variability by identifying extreme scores.
• For instance, the active membership of an officers’ club might include no one witha
rank below first lieutenant or above brigadier general.

VI. NORMAL DISTRIBUTIONS AND STANDARD (z) SCORE:

THE NORMAL CURVE


• A distribution based on 30,910 men usually is more accurate than one based on 3091,
and a distribution based on 3,091,000 usually is even more accurate.
• But it is prohibitively expensive in both time and money to even survey 30,910 people.
Fortunately, it is a fact that the distribution of heights for all American men—not just
3091 or even 3,091,000—approximates the normal curve, a well- documented
theoretical curve.

In Figure 5.2, the idealized normal curve has been superimposed on the original
distribution for 3091 men.
Interpreting the Shaded Area
• The total area under the normal curve in Figure 5.2 can be identified with all FBI
applicants.
• Viewed relative to the total area, the shaded area represents the proportion of
applicants who will be eligible because they are shorter than exactly 66 inches.

Finding a Proportion for the Shaded Area


• To find this new proportion, we cannot rely on the vertical scale in Figure 5.2, because
it describes as proportions the areas in the rectangular bars of histograms, not the
areas in the various curved sectors of the normal curve.

Properties of the Normal Curve

• A normal curve is a theoretical curve defined for a continuous variable, asdescribed


in Section 1.6, and noted for its symmetrical bell-shaped form.
• Because the normal curve is symmetrical, its lower half is the mirror image of its
upper half.
• Being bell shaped, the normal curve peaks above a point midway along the
horizontal spread and then tapers off gradually in either direction from the peak
• The values of the mean, median (or 50th percentile), and mode, located at a point
midway along the horizontal spread, are the same for the normal curve.

Importance of Mean and Standard Deviation


• When you’re using the normal curve, two bits of information are indispensable:
values for the mean and the standard deviation.

Different Normal Curves

• Every normal curve can be interpreted in exactly the same way once any distancefrom
the mean is
• expressed in standard deviation units.
• For example, .68, or 68 percent of the total area under a normal curve—any normal
curve—is within one standard deviation above and below the mean, and only .05, or
5 percent, of the total area is more than two standard deviations aboveand below the
mean.
z SCORES
• A z score is a unit-free, standardized score that, regardless of the original units of
measurement, indicates how many standard deviations a score is above or below the
mean of its distribution.
• To obtain a z score, express any original score, whether measured in inches,
milliseconds, dollars, IQ points, etc., as a deviation from its mean
• where X is the original score and μ and σ are the mean and the standard deviation,
respectively, for the normal distribution of the original scores.
• Since identical units of measurement appear in both the numerator and denominator
of the ratio for z, the original units of measurement cancel each otherand the z score
emerges as a unit-free or standardized number, often referred to as a standard score.
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean; and
2. a number indicating the size of its deviation from the mean in standard deviation
units.
• A z score of 2.00 always signifies that the original score is exactly two standard
deviations above its mean.
• Similarly, a z score of –1.27 signifies that the original score is exactly 1.27 standard
deviations below its mean.
• A z score of 0 signifies that the original score coincides with the mean.

Converting to z Scores
• To answer the question about eligible FBI applicants, replace X with 66 (the maximum
permissible height), μ with 69 (the mean height), and σ with 3 (the standard deviation
of heights) and solve for z as follows:
STANDARD NORMAL CURVE

• If the original distribution approximates a normal curve, then the shift to standardor z
scores will always produce a new distribution that approximates the standardnormal
curve.
• The standard normal curve always has a mean of 0 and a standard deviation of 1.
However, to verify that the mean of a standard normal distribution equals 0, replace
X in the z score formula with μ, the mean of any normal distribution, and then solve
for z:

• Likewise, to verify that the standard deviation of the standard normal distribution
equals 1, replace X in the z score formula with μ + 1σ, the value corresponding to one
standard deviation above the mean for any (nonstandard) normal distribution, and
then solve for z:

• Although there is an infinite number of different normal curves, each with its ownmean
and standard deviation, there is only one standard normal curve, with a mean of 0 and
a standard deviation of 1.
• Converting all original observations into z scores leaves the normal shape intactbut
not the units of measurement.
• Shaded observations of 66 inches, 1080 hours, and 90 IQ points all reappear as az
score of –1.00.

Standard Normal Table


• Essentially, the standard normal table consists of columns of z scores coordinated
with columns of proportions.
SOLVING NORMAL CURVE PROBLEMS
FINDING PROPORTIONS
Example: Finding Proportions for One Score
Example: Finding Proportions between Two Scores
Finding Proportions beyond Two Scores
FINDING SCORES
z Scores for Non-normal Distributions
• z scores are not limited to normal distributions. Non-normal distributions also canbe
transformed into sets of unit-free, standardized z scores.
• In this case, the standard normal table cannot be consulted, since the shape of the
distribution of z scores is the same as that for the original non-normal distribution.
• For instance, if the original distribution is positively skewed, the distribution of z
scores also will be positively skewed.
• Regardless of the shape of the distribution, the shift to z scores always produces a
distribution of standard scores with a mean of 0 and a standard deviation of 1.

Interpreting Test Scores


• The use of z scores can help you identify a person’s relative strengths and
weaknesses on several different tests.

Importance of Reference Group


• Remember that z scores reflect performance relative to some group rather thanan
absolute standard.
• A meaningful interpretation of z scores requires, therefore, that the nature of the
reference group be specified.
Standard Score
• Whenever any unit-free scores are expressed relative to a known mean and a known
standard deviation, they are referred to as standard scores.
• Although z scores qualify as standard scores because they are unit-free andexpressed
relative to a known mean of 0 and a known standard deviation of 1, other scores also
qualify as standard scores.

Transformed Standard Scores


• Being by far the most important standard score, z scores are often viewed as
synonymous with standard scores.
• For convenience, particularly when reporting test results to a wide audience, z scores
can be changed to transformed standard scores, other types of unit-free standard
scores that lack negative signs and decimal points.
• These transformations change neither the shape of the original distribution nor the
relative standing of any test score within the distribution.
Figure 5.11 shows the values of some of the more common types of transformed standard
scores relative to the various portions of the area under the normal curve.

You might also like