0% found this document useful (0 votes)
39 views182 pages

AD3491-Unit 2

Uploaded by

jeyasundari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views182 pages

AD3491-Unit 2

Uploaded by

jeyasundari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 182

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS LT PC

3003
COURSE OBJECTIVES:
CO1: Explain the data analytics pipeline
CO2: Apply descriptive analytics techniques to visualize data
CO3: Perform statistical inferences from data
CO4: Apply sampling test techniques to get the variance in the data
CO5: Build models for predictive analytics
TEXT BOOKS:
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning Publications, 2016. (first two
chapters for Unit I).
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.

Department of Computer Science and Business systems


UNIT II DESCRIPTIVE ANALYTICS 10
Topic Text Book 2
Unit – II DESCRIPTIVE ANALYTICS
Frequency distributions – Outliers –interpreting distributions –
graphs – averages - describing variability – interquartile range
variability for qualitative and ranked data -
Normal distributions –
z scores –Correlation – Scatter plots
regression – regression line – least squares regression line
standard error of estimate
interpretation of r2
multiple regression equations
regression toward the mean.

Department of Computer Science and Business systems


Descriptive Statistics:
Statistics exists because of the prevalence of variability in the real
world. In its simplest form, known as descriptive statistics, statistics
provides us with tools—tables, graphs, averages, ranges, correlations
—for organizing and summarizing the inevitable variability in
collections of actual observations or scores.

Department of Computer Science and Business systems


Inferential Statistics:
Statistics also provides tools—a variety of tests and estimates—for
generalizing beyond collections of actual observations. This more
advanced area is known as inferential statistics. Tools from
inferential statistics permit us to use a relatively small collection of
actual observations to evaluate.

Department of Computer Science and Business systems


Progress Check *1.1 :
(a) Students in my statistics class are, on average, 23 years old.
(b) The population of the world exceeds 7 billion (that is,
7,000,000,000 or 1 million multiplied by 7000).
(c) Either four or eight years have been the most frequent terms of
office actually served by U.S. presidents.
(d) Sixty-four percent of all college students favor right-to-abortion
laws.
Department of Computer Science and Business systems
Progress Check *1.1 :
(a) descriptive statistics
(b) inferential statistics
(c) descriptive statistics
(d) inferential statistics

Department of Computer Science and Business systems


Population and sample:
In statistics, a population refers to any complete collection of
observations or potential observations, whereas a sample refers to
any smaller collection of actual observations drawn from a
population.

Department of Computer Science and Business systems


Frequency distributions:
A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f ) of
occurrence in each class.

Department of Computer Science and Business systems


Frequency distributions: Quantitative Data

Department of Computer Science and Business systems


Frequency distributions: Qualitative Data

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Answers:

Department of Computer Science and Business systems


Levels of Measurement:

Department of Computer Science and Business systems


Levels of Measurement:

Department of Computer Science and Business systems


Levels of Measurement:
Progress Check 1.4

Department of Computer Science and Business systems


Levels of Measurement:

Answers:

Department of Computer Science and Business systems


 Quantitative variables can be further distinguished in terms of
whether they are discrete or continuous.
Discrete Variable: A discrete variable consists of isolated numbers
separated by gaps.
 Ex: Number of Students in a Class, Number of Cars in a Parking Lot, Roll
of a Die

Continuous Variable: A continuous variable consists of numbers


whose values, at least in theory, have no restrictions. Ex. – Height of
a person, Temperature, Time
Department of Computer Science and Business systems
Discrete and Continuous Variables:

Department of Computer Science and Business systems


Discrete and Continuous Variables:

Answers:

Department of Computer Science and Business systems


Independent and Dependent Variables:
Independent Variable:
An independent variable is a variable that is manipulated or
controlled by the experimenter. It is the variable that is thought to
have a causal effect on the dependent variable. In other words,
changes in the independent variable are hypothesized to cause
changes in the dependent variable.

Department of Computer Science and Business systems


Independent and Dependent Variables:
Dependent Variable:
A dependent variable is a variable that is observed and measured. It
is the outcome or response variable that is expected to be influenced
by the independent variable(s). The dependent variable is what
researchers are interested in understanding or predicting.

Department of Computer Science and Business systems


Independent and Dependent Variables:
Confounding Variable:
A confounding variable is a variable that is related to both the
independent and dependent variables. It can distort or confuse the
relationship between the independent and dependent variables,
leading to erroneous conclusions if not properly accounted for.
Confounding variables can create spurious correlations or mask true
relationships.
Department of Computer Science and Business systems
Independent and Dependent Variables:
Example:
Let's consider an example where we want to investigate the effect of
studying hours (independent variable) on exam scores (dependent
variable) while controlling for the confounding variable of prior
knowledge.

Department of Computer Science and Business systems


Independent and Dependent Variables:
Independent Variable: Studying Hours (e.g., number of hours spent
studying for an exam)
Dependent Variable: Exam Scores (e.g., the score achieved on the
exam)
Confounding Variable: Prior Knowledge (e.g., the level of
knowledge students have before studying for an exam)

Department of Computer Science and Business systems


Experiment
Treatment Control Group
Group

INDEPENDENT More Prior Not More Prior


VARIABLE Knowledge Knowledge

DEPENDENT Result Fail Result Fail


VARIABLE
Is difference
real or
transitory?
Department of Computer Science and Business systems
Observational Study
Treatment
Group

FIRST Prior Knowledge


VARIABLE Score

Are these two


variables
related?

SECOND No. of hours studied


VARIABLE

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Answers:

Department of Computer Science and Business systems


Frequency Distributions: A collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.
Frequency Distribution for Ungrouped Data: A frequency distribution
produced whenever observations are sorted into classes of single values.
Frequency Distribution for Grouped Data: A frequency distribution
produced whenever observations are sorted into classes of more than one
value.

Department of Computer Science and Business systems


FREQUENCY DISTRIBUTIONS FOR QUANTITATIVE DATA

 A frequency distribution is a collection of observations produced by


sorting observations into classes and showing their frequency (f) of
occurrence in each class. When observations are sorted into classes of
single values, as in Table 2.1, the result is referred to as a frequency
distribution for ungrouped data.
 The frequency distribution shown in Table 2.1 is only partially
displayed because there are more than 100 possible values between the
largest and smallest observations. Frequency distributions for
ungrouped data are much more informative when the number of
possible values is less than about 20. Under these circumstances, they
are a straightforward method for organizing data. Otherwise, if there
are 20 or more possible values, consider using a frequency distribution
for grouped data.

Department of Computer Science and Business systems


FREQUENCY DISTRIBUTIONS FOR QUANTITATIVE DATA

 As in Table 2.2, the result is referred to as a frequency distribution for


grouped data. Let’s look at the general structure of this frequency
distribution. Data are grouped into class intervals with 10 possible
values each. The bottom class includes the smallest observation (133),
and the top class includes the largest observation (245). The distance
between bottom and top is occupied by an orderly series of classes.
The frequency (f) column shows the frequency of observations in each
class and, at the bottom, the total number of observations in all classes.
Let’s summarize the more important properties of the distribution of
weights in Table 2.2. Although ranging from the 130s to the 240s, the
weights peak in the 150s, with a progressively decreasing but
relatively heavy concentration in the 160s and 170s. Furthermore, the
distribution of weights is not balanced about its peak, but tilted in the
direction of the heavier weights.

Department of Computer Science and Business systems


GUIDELINES FOR FREQUENCY DISTRIBUTIONS:
Essential
1. Each observation should be included in one, and only one, class.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–
140, 140–150, 150–160, etc., in which, because the boundaries of classes
overlap, an observation of 140 (or 150) could be assigned to either of two
classes.
2. List all classes, even those with zero frequencies.
Example: Listed in Table 2.2 is the class 210–219 and its frequency of zero. It
would be incorrect to skip this class because of its zero frequency.
Department of Computer Science and Business systems
GUIDELINES FOR FREQUENCY DISTRIBUTIONS:
Essential
3. All classes should have equal intervals.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use
130–139, 140–159, etc., in which the second class interval (140–159) is
twice as wide as the first class interval (130–139).

Department of Computer Science and Business systems


GUIDELINES FOR FREQUENCY DISTRIBUTIONS:
Optional
4. All classes should have both an upper boundary and a lower boundary.
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . . 10,
particularly 5 and 10 or multiples of 5 and 10.
6. The lower boundary of each class interval should be a multiple of the class
interval.
7. Aim for a total of approximately 10 classes.

Department of Computer Science and Business systems


Real limits:
The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled
boundary.
For example, the real limits for 140–149 and the actual width of the class
interval would be 10 (from 149.5 139.5 = 10).

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Answers:

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
OUTLIERS:
Outliers are a very extreme score or a value.
Check for Accuracy: Whenever you encounter an outrageously extreme value,
such as a GPA of 0.06, attempt to verify its accuracy. If the outlier survives an
accuracy check, it should be treated as a legitimate score.

Department of Computer Science and Business systems


Might Exclude from Summaries: You might choose to segregate (but not to
suppress!) an outlier from any summary of the data. For example, you might
relegate it to a footnote instead of using excessively wide class intervals in
order to include it in a frequency distribution. Or you might use various
numerical summaries, such as the median and interquartile range, that ignore
extreme scores, including outliers.

Department of Computer Science and Business systems


Might Enhance Understanding:
Insofar as a valid outlier can be viewed as the product of special
circumstances, it might help you to understand the data. For example, you
might understand better why crime rates differ among communities by studying
the special circumstances that produce a community with an extremely low (or
high) crime rate, or why learning rates differ among third graders by studying a
third grader who learns very rapidly (or very slowly).

Department of Computer Science and Business systems


Department of Computer Science and Business systems
RELATIVE FREQUENCY DISTRIBUTIONS:
Relative frequency distributions show the frequency of each class as a part or
fraction of the total frequency for the entire distribution.
Constructing Relative Frequency Distributions:
To convert a frequency distribution into a relative frequency distribution,
divide the frequency for each class by the total frequency for the entire
distribution.

Department of Computer Science and Business systems


RELATIVE FREQUENCY DISTRIBUTIONS:
To convert the relative frequencies in
Table 2.5 from proportions to
percentages, multiply each proportion
by 100; that is, move the decimal
point two places to the right.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
CUMULATIVE FREQUENCY DISTRIBUTIONS:
Cumulative frequency distributions show the total number of observations in
each class and in all lower-ranked classes.
Constructing Cumulative Frequency Distributions:
To convert a frequency distribution into a cumulative frequency distribution,
add to the frequency of each class the sum of the frequencies of all classes
ranked below it. This gives the cumulative frequency for that class. Begin with
the lowest-ranked class in the frequency distribution and work upward, finding
the cumulative frequencies in ascending order.

Department of Computer Science and Business systems


CUMULATIVE FREQUENCY DISTRIBUTIONS:

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Percentile Ranks:
When used to describe the relative position of any score within its parent
distribution, cumulative percentages are referred to as percentile ranks. The
percentile rank of a score indicates the percentage of scores in the entire
distribution with similar or smaller values than that score. Thus a weight has a
percentile rank of 80 if equal or lighter weights constitute 80 percent of the
entire distribution.

Department of Computer Science and Business systems


Frequency distributions for qualitative (nominal) data:
When, among a set of observations, any single observation is a word, letter, or
numerical code, the data are qualitative. Frequency distributions for qualitative
data are easy to construct. Simply determine the frequency with which
observations occupy each class, and report these frequencies as shown in Table
2.7 for the Facebook profile survey. This frequency distribution reveals that Yes
replies are approximately twice as prevalent as No replies.

Department of Computer Science and Business systems


Ordered Qualitative Data:
It’s totally arbitrary whether Yes is listed above or below No in Table 2.7.
When, however, qualitative data have an ordinal level of measurement because
observations can be ordered from least to most, that order should be preserved
in the frequency table, as illustrated in Table 2.8, in which military ranks are
listed in descending order from general to lieutenant.

Department of Computer Science and Business systems


Relative and Cumulative Distributions for Qualitative Data:
Frequency distributions for qualitative variables can always be converted into
relative frequency distributions, as illustrated in Table 2.8. Furthermore, if
measurement is ordinal because observations can be ordered from least to most,
cumulative frequencies (and cumulative percentages) can be used. As
illustrated in Table 2.8, it’s appropriate

Department of Computer Science and Business systems


Relative and Cumulative Distributions for Qualitative Data:

to claim, for example, that a captain has an approximate percentile rank of 63


among officers since 62.5 (or 63) is the cumulative percent for this class. If
measurement is only nominal because observations cannot be ordered, as in
Table 2.7, a cumulative frequency distribution is meaningless.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
GRAPHS FOR QUANTITATIVE DATA:
Histogram
A bar-type graph for quantitative data. The common boundaries between
adjacent bars emphasize the continuity of the data, as with continuous
variables.

Department of Computer Science and Business systems


GRAPHS FOR QUANTITATIVE DATA:
Histogram
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various
class intervals of the frequency distribution.
■ Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency. (The units along the vertical axis do not have to be the same width
as those along the horizontal axis.)
■ The intersection of the two axes defines the origin at which both numerical
scales equal 0.

Department of Computer Science and Business systems


GRAPHS FOR QUANTITATIVE DATA:
Histogram

Department of Computer Science and Business systems


 Numerical scales always increase from left to right along the horizontal axis and
from bottom to top along the vertical axis. It is considered good practice to use
wiggly lines to highlight breaks in scale, such as those along the horizontal axis in
Figure 2.1, between the origin of 0 and the smallest class of 130–139.
 The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes. Notice that adjacent bars in histograms have
common boundaries that emphasize the continuity of quantitative data for
continuous variables. The introduction of gaps between adjacent bars would
suggest an artificial disruption in the data more appropriate for discrete
quantitative variables or for qualitative variables.

Department of Computer Science and Business systems


Frequency Polygon:
An important variation on a histogram is the frequency polygon, or line graph.
Frequency polygons may be constructed directly from frequency distributions.
A. This panel shows the histogram for the weight distribution.
B. Place dots at the midpoints of each bar top or, in the absence of bar tops, at
midpoints for classes on the horizontal axis, and connect them with straight
lines.
[To find the midpoint of any class, such as 160–169, simply add the two tabled
boundaries (160 + 169 = 329) and divide this sum by 2 (329/2 = 164.5).]

Department of Computer Science and Business systems


Frequency Polygon:
C. Anchor the frequency polygon to the horizontal axis. First, extend the upper
tail to the midpoint of the first unoccupied class (250–259) on the upper flank
of the histogram. Then extend the lower tail to the midpoint of the first
unoccupied class (120–129) on the lower flank of the histogram. Now all of the
area under the frequency polygon is enclosed completely.
D. Finally, erase all of the histogram bars, leaving only the frequency polygon.
Frequency polygons are particularly useful when two or more frequency
distributions or relative frequency distributions are to be included in the same
graph.
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Stem and Leaf Display:
A device for sorting quantitative data on the basis of leading and trailing digits.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Typical Shapes:
Whether expressed as a histogram, a
frequency polygon, or a stem and
leaf display, an important
characteristic of a frequency
distribution is its shape. Figure 2.3
shows some of the more typical
shapes for smoothed frequency
polygons (which ignore the
inevitable irregularities of real data).
Department of Computer Science and Business systems
A GRAPH FOR QUALITATIVE (NOMINAL) DATA:
A bar-type graph for qualitative data. Gaps between adjacent bars emphasize
the discontinuous nature of the data.
As with histograms, equal segments along the horizontal axis are allocated to
the different words or classes that appear in the frequency distribution for
qualitative data. Likewise, equal segments along the vertical axis reflect
increases in frequency. The body of the bar graph consists of a series of bars
whose heights reflect the frequencies for the various words or classes.

Department of Computer Science and Business systems


A GRAPH FOR QUALITATIVE (NOMINAL) DATA:

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
MISLEADING GRAPHS:
Graphs can be constructed in an unscrupulous manner
to support a particular point of view.
The width of the Yes bar is more than three times that
of the No bar, thus violating the custom that bars be
equal in width.
■ The lower end of the frequency scale is omitted,
thus violating the custom that the entire scale be
reproduced, beginning with zero.

Department of Computer Science and Business systems


MISLEADING GRAPHS:
■ The height of the vertical axis is several times the width of the horizontal
axis, thus violating the custom, heretofore unmentioned, that the vertical axis
be approximately as tall as the horizontal axis is wide. Beware of graphs in
which, because the vertical axis is many times larger than the horizontal axis
(as in Figure 2.5), frequency differences are exaggerated, or in which, because
the vertical axis is many times smaller than the horizontal axis, frequency
differences are suppressed.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Describing Data with Averages:
MODE:
The mode reflects the value of the most frequently occurring score.
Bimodal:
Describes any distribution with two obvious peaks.
Multimodal:
Distributions with more than two peaks are referred to as multimodal.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
MEDIAN:
The median reflects the middle value when observations are ordered from least
to most.
The median splits a set of ordered observations into two equal parts, the upper
and lower halves. In other words, the median has a percentile rank of 50, since
observations with equal or smaller values constitute 50 percent of the entire
distribution.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
MEAN:
The mean is the most common average, one you have doubtless calculated
many times.
The mean is found by adding all scores and then dividing by the number of
scores.
That is,
Mean =

Department of Computer Science and Business systems


Sample Mean ():
The balance point for a sample, found by dividing the sum for the values of all
scores in the sample by the number of scores in the sample.
Sample Mean
Sample Size (n):
The total number of scores in the sample.

Department of Computer Science and Business systems


Population Mean (μ) :
The balance point for a population, found by dividing the sum for all scores in
the population by the number of scores in the population.
Population Mean
Population Size (N) :
The total number of scores in the population.
Mean as Balance Point:
The mean serves as the balance point for its frequency distribution.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Measures of Variability :
Measures of variability reflect the amount by which observations are dispersed or
scattered in a distribution. These measures assume a key role in the analysis of research
results.
The simplest measure of variability, the range, is readily calculated and understood, but it
has two shortcomings.
Among measures of variability, the variance and particularly the standard deviation
occupy the same exalted position as does the mean among measures of central tendency.
Ex: Range, Variance, Standard Deviation, Degrees of Freedom (DF), Interquartile
Range(IQR)

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Range:
The range is the difference between the largest and smallest scores.
Shortcomings of Range:
The range has several shortcomings. First, since its value depends on only two
scores—the largest and the smallest—it fails to use the information provided by
the remaining scores. Furthermore, the value of the range tends to increase with
increases in the total number of scores.

Department of Computer Science and Business systems


Variance:
It serve as valid measures of variability.
*A measure of variability, known as the mean absolute deviation.
The mean of all squared deviation scores also known as the variance.
Standard Deviation:
Simply take the square root of the variance. This produces a new measure,
known as the standard deviation, that describes variability in the original units
of measurement.

Department of Computer Science and Business systems


Actually Exceeds Mean Deviation:
Strictly speaking, the standard deviation usually exceeds the mean deviation or,
more accurately, the mean absolute deviation.
Majority of Scores within One Standard Deviation:
For most frequency distributions, a majority (often as many as 68 percent) of
all scores are within one standard deviation on either side of the mean.
A small minority scores Deviate More than two standard deviations:
For most frequency distributions, a small minority (often as small as 5 percent)
of all scores deviate more than two standard deviations on either side of the
mean.
Department of Computer Science and Business systems
The mean is a measure of position, but the standard deviation is a measure of distance (on
either side of the mean of the distribution). The value of the standard deviation can never
be negative.
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Sum of Squares (SS):
 The sum of squared deviation scores, or more simply the sum of squares,
symbolized by SS

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Standard Deviation for Population σ
A rough measure of the average amount by which scores in the population deviate on either
side of their population mean.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
DEGREES OF FREEDOM (df):
Degrees of freedom (df) refers to the number of values that are free to vary,
given one or more mathematical restrictions, in a sample being used to estimate
a population characteristic.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Interquartile Range (IQR) :
The range for the middle 50 percent of the scores.

Department of Computer Science and Business systems


Interquartile Range (IQR) :

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Measures Of Variability For Qualitative And Ranked Data:
Qualitative Data:
Measures of variability are virtually nonexistent for qualitative or nominal data.
It is probably adequate to note merely whether scores are evenly divided
among the various classes (maximum variability), unevenly divided among the
various classes (intermediate variability), or concentrated mostly in one class
(minimum variability).

Department of Computer Science and Business systems


Measures Of Variability For Qualitative And Ranked Data:
Ordered Qualitative and Ranked Data:
If qualitative data can be ordered because measurement is ordinal (or if the data
are ranked), then it’s appropriate to describe variability by identifying extreme
scores (or ranks).

Department of Computer Science and Business systems


THE NORMAL CURVE:
A theoretical curve noted for its symmetrical bell-shaped form.

Department of Computer Science and Business systems


Properties of the Normal Curve:
 Normal curve is a theoretical curve defined for a continuous variable, and noted
for its symmetrical bell-shaped form.
 Because the normal curve is symmetrical, its lower half is the mirror image of its
upper half.
 Normal curve peaks above a point midway along the horizontal spread and then
tapers off gradually in either direction from the peak(without actually touching
the horizontal axis, since, in theory, the tails of a normal curve extend infinitely
far).
 The values of the mean, median (or 50th percentile), and mode, located at a point
midway along the horizontal spread, are the same for the normal curve.
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Importance of Mean and Standard Deviation:
 When you’re using the normal curve, two bits of information are indispensable:
 values for the mean and the standard deviation.
 For example, before the normal curve can be used to answer the question about
eligible FBI applicants, it must be established that, for the original distribution of
3091 men, the mean height equals 69 inches and the standard deviation equals 3
inches.

Department of Computer Science and Business systems


Different Normal Curves:
 When you’re using the normal curve, two bits of information are indispensable:
 values for the mean and the standard deviation.
 For example, before the normal curve can be used to answer the question about
eligible FBI applicants, it must be established that, for the original distribution of
3091 men, the mean height equals 69 inches and the standard deviation equals 3
inches.

Department of Computer Science and Business systems


Different Normal Curves:
Since the normal curve is an idealized curve that is presumed to describe a complete set of
observations or a population, the symbols μ and σ, representing the mean and standard
deviation of the population, respectively.

Department of Computer Science and Business systems


z SCORES:
A zscore is a unit-free, standardized score that, regardless of the original units of
measurement, indicates how many standard deviations a score is above or below the mean of
its distribution.

A z score consists of two parts:


1. a positive or negative sign indicating whether it’s above or below the mean; and
2. a number indicating the size of its deviation from the mean in standard deviation units.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
z scores can be negative, but areas under the normal curve cannot.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
SCATTERPLOTS:
A graph which containing a cluster of dots that represents all pairs of scores.

Department of Computer Science and Business systems


Positive Relationship:
Occurs insofar as pairs of scores tend to occupy similar relative positions (high
with high and low with low) in their respective distributions.
Negative Relationship:
Occurs insofar as pairs of scores tend to occupy dissimilar relative positions
(high with low and vice versa) in their respective distributions.
Little or No Relationship:
No regularity is apparent among the pairs of scores in panel C.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Perfect Relationship:
A dot cluster that equals (rather than merely approximates) a straight line
reflects a perfect relationship between two variables. In practice, perfect
relationships are most unlikely.
Linear Relationship:
A relationship that can be described best with a straight line.
Curvilinear Relationship:
A relationship that can be described best with a curved line.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Correlation coefficient:
A correlation coefficient is a number between –1 and 1 that describes the
relationship between pairs of variables.
Named in honor of the British scientist Karl Pearson, the Pearson correlation
coefficient (r).
1. The sign of r indicates the type of linear relationship, whether positive or
negative.
2. The numerical value of r, without regard to sign, indicates the strength of the
linear relationship.
A positive value of r reflects a tendency for pairs of scores to occupy similar
relative locations (high with high and low with low) in their respective
distributions, while a negative value of r reflects a tendency for pairs of scores
to occupy dissimilar relative locations (high with low and vice versa) in their
respective distributions.
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
TWO ROUGH PREDICTIONS
Predict “Relatively Large Number”:
 To obtain a slightly more precise prediction for Emma, refer to the scatter
plot for the original five friends shown in Figure 7.1. Notice that Emma’s
plan to send 11 cards locates her along the X axis between the 9 cards sent
by Steve and the 13 sent by Doris.
 Using the dots for Steve and Doris as guides, construct two strings of arrows,
one beginning at 9 and ending at 18 for Steve and the other beginning at 13
and ending at 14 for Doris.
 you could predict that Emma’s return should be between 14 and 18 cards, the
numbers received by Doris and Steve.

Department of Computer Science and Business systems


A REGRESSION LINE
 All five dots contribute to the more precise prediction, illustrated in Figure
7.2, that Emma will receive 15.20 cards.
 If all five dots had defined a single straight line, placement of the regression
line would have been simple; merely let it pass through all dots.
 When the dots fail to define a single straight line, as in the scatterplot for the
five friends, placement of the regression line represents a compromise.
 It passes through the main cluster, possibly touching some dots but missing
others.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Predictive Errors

 The largest predictive error, shown as a


broken vertical line, occurs for Steve, who
sent 9 cards. Although he actually received
18 cards, he should have received slightly
fewer than 14 cards, according to the
regression line.
 The smallest predictive error—none
whatsoever—occurs for Mike, who sent 7
cards. He actually received the 12 cards that
he should have received, according to the
regression line

Department of Computer Science and Business systems


Department of Computer Science and Business systems
LEAST SQUARES REGRESSION LINE:
 To avoid the arithmetic standoff of zero always produced by adding positive
and negative predictive errors (associated with errors above and below the
regression line, respectively), the placement of the regression line minimizes
not the total predictive error but the total squared predictive error, that is, the
total for all squared predictive errors.
 When located in this fashion, the regression line is often referred to as the
least squares regression line.

 Y´ represents the predicted value (the predicted number of cards that will be
 received by any new friend, such as Emma); X represents the known value
(the known number of cards sent by any new friend); and b and a represent
numbers calculated from the original correlation analysis
Department of Computer Science and Business systems
LEAST SQUARES REGRESSION LINE:

Least Squares Regression Equation The equation that minimizes the total of all
squared prediction errors for known Y scores in the original correlation
analysis
Department of Computer Science and Business systems
Referring Table 6.3

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
Department of Computer Science and Business systems
STANDARD ERROR OF ESTIMATE, Sy|x:
 The standard error of estimate represents a special kind of standard
deviation that reflects the magnitude of predictive error.
 Although we predicted that Emma’s investment of 11 cards will yield a
return of 15.20 cards, we would be surprised if she actually received 15
cards.
 It is more likely that because of the imperfect relationship between cards sent
and cards received, Emma’s return will be some number other than 15.
 Although designed to minimize predictive error, the least squares equation
does not eliminate it.
 Therefore, our next task is to estimate the amount of error associated with
our predictions.
 The smaller the estimated error is, the better the prognosis will be for our
predictions.
Department of Computer Science and Business systems
You might find it helpful to think of the standard error of estimate, s y|x, as a
rough measure of the average amount of predictive error—that is, as a rough
measure of the average amount by which known Y values deviate from their
predicted Y values.
Department of Computer Science and Business systems
STANDARD ERROR OF ESTIMATE, Sy|x:
 You might find it helpful to think of the standard error of estimate, Sy|x, as a
rough measure of the average amount of predictive error-that is, as a rough
measure of the average amount by which known Y values deviate from their
predicted Y values.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Homoscedasticity:
 Use of the standard error of estimate, sy|x ,
assumes that except for chance, the dots in the
original scatterplot will be dispersed equally
about all segments of the regression line.
 You need to worry about violating this
assumption, homoscedasticity only when the
scatterplot reveals a dramatically different type
of dot cluster, such as that shown in Figure 7.4.
 At the very least, the standard error of estimate
for the data in Figure 7.4 should be used
cautiously, since its value overestimates the
variability of dots about the lower half of the
regression line and underestimates the variability
of dots about the upper half of the regression
line.

Department of Computer Science and Business systems


Department of Computer Science and Business systems
INTERPRETATION OF r2 :
 The squared correlation coefficient r2, provides us with not only a key
interpretation of the correlation coefficient but also a measure of predictive
accuracy that supplements the standard error of estimate, Sy|x.
 The proportion of the total variability in one variable that is predictable
from its relationship with the other variable.
Repetitive Prediction of the Mean:
 Using the repetitive prediction of Y for each of the Y scores of all five
friends will supply us with a frame of reference against which to evaluate our
customary predictive effort based on the correlation between cards sent (X)
and cards received (Y).
 Any predictive effort that capitalizes on an existing correlation between X
and Y should be able to generate a smaller error variability—and, conversely,
more accurate predictions of Y—than a primitive effort based only on the
repetitive prediction of Y.
Department of Computer Science and Business systems
Predictive Errors:

Department of Computer Science and Business systems


Predictive Errors:
 Overall, as expected, errors are smaller when customized predictions of Y′
from the least squares equation can be used (because X scores are known)
than when only the repetitive prediction of Y can be used (because X scores
are ignored.)
 As with most statistical phenomena, there are exceptions: The predictive
error for Doris is slightly larger when the least squares equation is used.

Department of Computer Science and Business systems


Error Variability (Sum of Squares):

Department of Computer Science and Business systems


Repetitive Prediction of the Mean:
 For the sake of the present argument, pretend that we know the Y scores
(cards received), but not the corresponding X scores (cards sent), for each of
the five friends.
 Lacking information about the relationship between X and Y scores, we
could not construct a regression equation and use it to generate a customized
prediction, Y′, for each friend.
 We could, however, mount a primitive predictive effort by always predicting
the mean Y, for each of the five friends’ Y scores.

Department of Computer Science and Business systems


Proportion of Predicted Variability:
To obtain an SS measure of the actual gain in accuracy due to the least squares
predictions, subtract the residual variability from the total variability, that is,
subtract SSy|x from SSy, to obtain

To express this difference, 51.2, as a gain in accuracy relative to the original


error variability for the repetitive prediction of , Y divide the above difference
by SSy , that is,

This result, .64 or 64 percent, represents the proportion or percent gain in


predictive accuracy when the repetitive prediction of Y is replaced by a series of
customized Y′ predictions based on the least squares equation.

Department of Computer Science and Business systems


Proportion of Predicted Variability:

Department of Computer Science and Business systems


r2 Interpretation:

Department of Computer Science and Business systems


Department of Computer Science and Business systems
Department of Computer Science and Business systems
MULTIPLE REGRESSION EQUATIONS:
A least squares equation that contains more than one predictor or X
variable.

REGRESSION TOWARD THE MEAN:


A tendency for scores, particularly extreme scores, to shrink toward the mean.

Department of Computer Science and Business systems

You might also like