0% found this document useful (0 votes)
69 views

Data Science - Unit-4

Uploaded by

nabisoj419
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Data Science - Unit-4

Uploaded by

nabisoj419
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

4-2 B.

Tech IT Regulation: R19 Data Science: UNIT-4

UNIT-4
Describing Data II
Syllabus:
Describing Data II: Normal distributions – z scores – normal curve problems–
finding proportions – finding scores –more about z scores – correlation – scatter
plots – correlation coefficient for quantitative data –computational formula for
correlation coefficient – regression – regression line – least squares regression line –
standard error of estimate – interpretation of r2– multiple regression equations –
regression toward the mean.
The Normal Distributions
 A Normal distribution (or Gaussian distribution) is a continuous
probability distribution that is symmetrical on both sides of the mean, so
that right side of the center is mirror image of the left side.
 Normal distribution is so important because it accurately describe the
distribution of values for many natural phenomena.
 Many observed frequency distributions approximate the well-documented
normal curve, an important theoretical curve noted for its symmetrical
bell-shaped form.
 Characteristics that are the sum of many independent processes
frequently follow normal distributions. For example, heights, blood
pressure, measurement error, and IQ scores follow the normal
distribution.
 The normal curve is defined in terms of standard deviation and mean.
 The normal curve can be used to obtain answers to a wide variety of
questions.

Properties of the Normal Curve:


Important properties of the normal curve are:
 The normal curve is a theoretical curve defined for a continuous
variable.
 The normal curve is symmetrical, its lower half is the mirror image of its
upper half.
 It is in bell-shaped form
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 The normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the peak.
 The curve approaches the x-axis, but it never touches, and it extends
farther away from the mean.
 The values of the mean, median and mode, located at a point midway
along the horizontal spread, are the same for the normal curve.
 The total area under the curve should be equal to 1.
 The normal distribution curve must have only one peak. (i.e., unimodal)
Different Normal Curves
 When using the normal curve, two bits of information are indispensable:
values for the mean and the standard deviation
 Various types of normal curves are produced by an arbitrary change in
the value of either the mean (μ) or the standard deviation (σ)
 Every normal curve can be interpreted in exactly the same way once any
distance from the mean is expressed in standard deviation units

Z Scores
 A unit-free, standardized score that indicates how many standard
deviations a score is above or below the mean of its distribution is called
Z Score
 To obtain a z score, express any original score, whether measured in
inches, milliseconds, dollars, IQ points, etc., as a deviation from its mean
(by subtracting its mean) and then split this deviation into standard
deviation units (by dividing by its standard deviation), that is,

where X is the original score and μ and σ are the mean and the standard
deviation, respectively, for the normal distribution of the original scores

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 2
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 A z score consists of two parts:


1. a positive or negative sign indicating whether it’s above or below
the mean; and
2. a number indicating the size of its deviation from the mean in
standard deviation units.
 Example: A z score of 2.00 always signifies that the original score is
exactly two standard deviations above its mean. Similarly, a z score of –
1.27 signifies that the original score is exactly 1.27 standard deviations
below its mean. A z score of 0 signifies that the original score coincides
with the mean.
 Problem: Express each of the following scores as a z score:
(a) Margaret’s IQ of 135, given a mean of 100 and a standard deviation
of 15
(b) a score of 470 on the SAT math test, given a mean of 500 and a
standard deviation of 100
(c) a daily production of 2100 loaves of bread by a bakery, given a
mean of 2180 and a standard deviation of 50
(d) Sam’s height of 69 inches, given a mean of 69 and a standard
deviation of 3
(e) a thermometer-reading error of –3 degrees, given a mean of 0 degrees
and a standard deviation of 2 degrees
Answers:
(a) z = (135-100)/15= 2.33
(b) z = (470-500)/100= 0.30
(c) z = (2100-2180)/50= -1.60
(d) z = (69-69)/3= 0.00
(e) z = (-3-0)/2= -1.50

STANDARD NORMAL CURVE


 If the original distribution approximates a normal curve, then the shift to
standard or z scores will always produce a new distribution that
approximates the standard normal curve.
 This is the one normal curve for which a table is actually available.
 The standard normal curve always has a mean of 0 and a standard
deviation of 1.
 Although there is infinite number of different normal curves, each with its
own mean and standard deviation, there is only one standard normal
curve, with a mean of 0 and a standard deviation of 1.
 Converting all original observations into z scores leaves the normal shape
intact but not the units of measurement.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Standard Normal Table (Z Table)


 The standard normal table consists of columns of z scores coordinated
with columns of proportions.
 In a typical problem, access to the table is gained through a z score, such
as –1.00, and the answer is read as a proportion
 Table columns are arranged in sets of three, designated as A, B, and C in
the legend at the top of the table. When using the top legend, all entries
refer to the upper half of the standard normal curve.
 The entries in column A are z scores, beginning with 0.00 and ending
with 4.00. Given a z score of zero, column B indicates the proportion of
area between the mean and the z score, and column C indicates the
proportion of area beyond the z score, in the upper tail of the standard
normal curve.
 Because of the symmetry of the normal curve, the entries in table also can
refer to the lower half of the normal curve. Now the columns are
designated as A′, B′, and C′ in the legend at the bottom of the table. When
using the bottom legend, all entries refer to the lower half of the standard
normal curve.
 The nonzero entries in column A′ are negative z scores, beginning with
0.01 and ending with 4.00.
 Column B′ indicates the proportion of area between the mean and the
negative z score, and column C′ indicates the proportion of area beyond
the negative z score, in the lower tail of the standard normal curve.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Normal Curve Problems


 There are two general types of normal curve problems:
(1) Finding proportions: these problems require finding the unknown
proportion (of area) associated with some score or pair of scores and
(2) Finding scores: these problems require finding the unknown score or
scores associated with some area.
 Answers to the first type of problem usually require converting original
scores into z scores and answers to the second type of problem usually
require translating a z score back into an original score.
 Rough graphs of normal curves can be used an aid to visualizing the
solution. Only after thinking through to a solution, do any calculations
and consult the normal tables.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Fig: Interpretation of standard normal table


 When using the standard normal table, it is important to remember that
 For any z score, the corresponding proportions in columns B and
C (or columns B′ and C′) always sum to .5000.
 Similarly, the total area under the normal curve always equals
1.0000, the sum of the proportions in the lower and upper halves,
that is, .5000 + .5000.
 Finally, although a z score can be either positive or negative, the
proportions of area under the curve are always positive or zero but
never negative
Finding Proportions
 In these Normal curve problems, standard normal table (table A) must be
consulted to find the unknown proportion (of area) associated with some
known score or pair of known scores.
Finding Proportions for One Score
 Step-by-step procedure:
1. Sketch a normal curve and shade in the target area
2. Plan solution according to the normal table.
X 
3. Convert X to z using formula, z 

4. Find the target area.
 Example: to find the proportion of all persons who are shorter than
exactly 66 inches, given that the distribution of heights approximates a
normal curve with a mean of 69 inches and a standard deviation of 3
inches.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Finding Proportions between Two Scores


 Step-by-step procedure:
1. Sketch a normal curve and shade in the target area
2. Plan solution according to the normal table.
X 
3. Convert X to z using formula, z 

4. Find the target area.
 Example: Assume that, when not interrupted artificially, the gestation
periods for human foetuses approximate a normal curve with a mean of
270 days (9 months) and a standard deviation of 15 days. What
proportion of gestation periods will be between 245 and 255 days?

Finding Proportions beyond Two Scores


 Step-by-step procedure:
1. Sketch a normal curve and shade in the two target areas
2. Plan your solution according to the normal table.
X 
3. Convert X to z using formula, z 

4. Find the target area.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 Problem: Assume that high school students’ IQ scores approximate a


normal distribution with a mean of 105 and a standard deviation of 15.
What proportion of IQs are more than 30 points either above or below the
mean?
Answer:
Expressing IQ scores of 135 and 75 as

Finding Scores
 In this type of normal curve problems standard normal table (table A)
must be consulted to find the unknown score or scores associated with
some known proportion.
 Essentially, this type of problem requires that the use of table A by
entering proportions in columns B, C, B′, or C′ and finding z scores listed
in columns A or A′.
Finding One Score
 Step-by-step procedure:
1. Sketch a normal curve and, on the correct side of the mean, draw a
line representing the target score
2. Plan your solution according to the normal table.
3. Find z.
4. Convert z to the target score using formula, X=  + (z) (  )

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 Problem: Exam scores for a large psychology class approximate a normal


curve with a mean of 230 and a standard deviation of 50. Furthermore,
students are graded “on a curve,” with only the upper 20 percent being
awarded grades of A. What is the lowest score on the exam that receives
an A?

Finding Two Scores


 Step-by-step procedure:
1. Sketch a normal curve. On either side of the mean, draw two lines
representing the two target scores
2. Plan your solution according to the normal table.
3. Find z.
4. Convert z to the target score, using formula X =  + (z) (  )
 Problem: Assume that the annual rainfall in the San Francisco area
approximates a normal curve with a mean of 22 inches and a standard
deviation of 4 inches. What are the rainfalls for the more atypical years,
defined as the driest 2.5 percent of all years and the wettest 2.5 percent of
all years?

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

More About Z Scores


Z Scores for Non-normal Distributions
 z scores are not limited to normal distributions.
 Non-normal distributions also can be transformed into sets of unit-free,
standardized z scores.
 In this case, the standard normal table cannot be consulted, since the
shape of the distribution of z scores is the same as that for the original
non-normal distribution.
 Regardless of the shape of the distribution, the shift to z scores always
produces a distribution of standard scores with a mean of 0 and a standard
deviation of 1.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 Z scores can provide efficient descriptions of relative performance on one


or more tests.
 The use of z scores can help to identify a person’s relative strengths and
weaknesses on several different tests.

 For example, above table shows Sharon’s scores on college achievement


tests in three different subjects. The evaluation of her test performance is
greatly facilitated by converting her raw scores into the z scores listed in
the final column of above table. A glance at the z scores suggests that
although she did relatively well on the math test, her performance on the
English test was only slightly above average, as indicated by a z score of
0.50, and her performance on the psychology test was slightly below
average, as indicated by a z score of –0.67.
Standard Score
 Any unit-free scores expressed relative to a known mean and a known
standard deviation is called standard score.
 Although z scores qualify as standard scores because they are unit-free
and expressed relative to a known mean of 0 and a known standard
deviation of 1, other scores also qualify as standard scores.
Transformed Standard Scores
 z scores can be changed to transformed standard scores, other types of
unit-free standard scores that lack negative signs and decimal points.
 These transformations change neither the shape of the original
distribution nor the relative standing of any test score within the
distribution.
 For example, a test score located one standard deviation below the mean
might be reported not as a z score of –1.00 but as a T score of 40 in a
distribution of T scores with a mean of 50 and a standard deviation of 10.
 Following figure shows the values of some of the more common types of
transformed standard scores relative to the various portions of the area
under the normal curve.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Converting to Transformed Standard Scores


 Following formula can be used to convert any original standard score, z,
into a transformed standard score, z′, having a distribution with any
desired mean and standard deviation.
z’ = desired mean + (z) (desired standard deviation)
where z′ (called z prime) is the transformed standard score and z is the
original standard score.
 Problem: Assume that each of the raw scores listed originates from a
distribution with the specified mean and standard deviation. After
converting each raw score into a z score, transform each z score into a
series of new standard scores with means and standard deviations of 50
and 10, 100 and 15, and 500 and 100, respectively.

Answers:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Correlation
 Two variables are related if pairs of scores show an orderliness that can
be depicted graphically with a scatter plot and numerically with a
correlation coefficient.
 The data in following table represent a very simple observational study
with two dependent variables.

Three Types of Relationships (Types of correlation)


 Positive Relationship
 Negative Relationship
 Little or No Relationship
Positive Relationship
 Two variables are positively related if pairs of scores tend to occupy
similar relative positions (relatively low values are paired with relatively
low values, and relatively high values are paired with relatively high
values,) in their respective distributions.
 Example: (Height, Weight)
(Temperature, Ice cream sales)
Negative Relationship
 Two variables are negatively related if pairs of scores tend to occupy
dissimilar relative positions (relatively low values are paired with
relatively high values, and relatively high values are paired with
relatively low values,) in their respective distributions.
 Example: (Exercise, Body Fat)
(Watching Movies, Exam scores)
Little or No Relationship
 No regularity is apparent among the pairs of scores
 Example: (Shoe Size, Movies Watched)
(Coffee Consumption, Intelligence)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Describing relationship between pairs of variables


 There are two more efficient and exact statistical techniques for
describing relationship between two variables, namely, a special graph
known as a scatter plot and a measure known as a correlation coefficient.
Scatter Plots
 A scatter plot is a graph containing a cluster of dots that represents all
pairs of scores.
 We can use any dot cluster as a preview of a fully measured relationship.
Construction
 To construct a scatter plot scale each of the two variables along the
horizontal (X) and vertical (Y) axes, and use each pair of scores to locate
a dot within he scatter plot.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Categorizing relationship using scatter plot


( Positive, Negative, or Little or No Relationship?)
 A dot cluster that has a slope from the lower left to the upper right
reflects a positive relationship. Small values of one variable are paired
with small values of the other variable, and large values are paired with
large values.
 Example: In panel A of below figure, short people tend to be light, and
tall people tend to be heavy.
 A dot cluster that has a slope from the upper left to the lower right
reflects a negative relationship. Small values of one variable tend to be
paired with large values of the other variable, and vice versa.
 Example: In panel B of below figure, people who have smoked heavily
for few years or not at all tend to have longer lives, and people who have
smoked heavily for many years tend to have shorter lives
 A dot cluster that lacks any apparent slope reflects little or no
relationship. Small values of one variable are just as likely to be paired
with small, medium, or large values of the other variable.
 Example: In panel C of below figure, notice that the dots are strewn about
in an irregular shotgun fashion, suggesting that there is little or no
relationship between the height of young adults and their life
expectancies.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 15
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Perfect Relationship
 A dot cluster that equals (rather than merely approximates) a straight line
reflects a perfect relationship between two variables.
Linear Relationship
 A relationship that can be described best with a straight line.
Curvilinear Relationship
 A relationship that can be described best with a curved line.

A Correlation Coefficient For Quantitative Data : r


 A correlation coefficient is a number between –1 and 1 that describes
the relationship between pairs of variables.
 The type of correlation coefficient, designated as r, that describes the
linear relationship between pairs of variables for quantitative data is
called the Pearson correlation coefficient, r, can equal any value
between –1.00 and +1.00.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 Furthermore, the following two properties apply:


1. The sign of r indicates the type of linear relationship, whether positive
or negative.
2. The numerical value of r, without regard to sign, indicates the strength
of the linear relationship.
 A number with a plus sign (or no sign) indicates a positive relationship,
and a number with a minus sign indicates a negative relationship. For
example, an r with a plus sign describes the positive relationship between
height and weight, and an r with a minus sign describes the negative
relationship between heavy smoking and life expectancy.
 The more closely a value of r approaches either –1.00 or +1.00, the
stronger (more regular) the relationship. Conversely, the more closely the
value of r approaches 0, the weaker (less regular) the relationship.
 For example, an r of –.90 indicates a stronger relationship than does an r
of –.70, and an r of –.70 indicates a stronger relationship than does an r of
.50
 A correlation coefficient, regardless of size, never provides information
about whether an observed relationship reflects a simple cause-effect
relationship or some more complex state of affairs.
Computation Formula for Correlation Coefficient
 Correlation Coefficient can be calculated by using following Computation
Formula

where the two sum of squares terms in the denominator are defined as

and the sum of the products term in the numerator, SPxy, is defined as

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Problem: Couples who attend a clinic for first pregnancies are asked to
estimate (independently of each other) the ideal number of children. Given that
X and Y represent the estimates of females and males, respectively, the results
are as follows:

Calculate a value for r, using the computation formula


Answer:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Regression
 A regression is a statistical technique that relates a dependent variable to
one or more independent (explanatory) variables.
 A regression model is able to show whether changes observed in the
dependent variable are associated with changes in one or more of the
explanatory variables.
 Regression captures the correlation between variables observed in a data
set, and quantifies whether those correlations are statistically significant
or not.
Regression Line
 A regression line is a line that best describes the behaviour of a set of
data. In other words, it’s a line that best fits the trend of a given data.
 The purpose of the line is to describe the interrelation of a dependent
variable (Y variable) with one or many independent variables (X
variable).
 By using the equation obtained from the regression line an analyst can
forecast future behaviours of the dependent variable by inputting different
values for the independent ones.

Types of regression
The two basic types of regression are
 Simple linear regression: Simple linear regression uses one
independent variable to explain or predict the outcome of the
dependent variable Y
 Multiple linear regression: Multiple linear regressions use two or
more independent variables to predict the outcome

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Predictive Errors
 Prediction error refers to the difference between the predicted values
made by some model and the actual values.

Least Squares Regression Line


 The placement of the regression line minimizes not the total predictive
error but the total squared predictive error, that is, the total for all squared
predictive errors. When located in this fashion, the regression line is often
referred to as the least squares regression line.
 The Least Squares Regression Line is the line that minimizes the sum of
the residuals squared. The residual is the vertical distance between the
observed point and the predicted point, and it is calculated by subtracting
ˆy from y.
 Least Squares Regression Equation: an equation pinpoints the exact least
squares regression line for any scatter plot. Most generally, this equation
reads:
Y´ = bX + a
where Y´ represents the predicted value
X represents the known
b and a represent numbers calculated from the original correlation
analysis, described by

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 The regression equation can be used to predict the Y’ value for given X
value by simply substituting X value in equation.

 Problem: Assume that an r of .30 describes the relationship between


educational level (highest grade completed) and estimated number of
hours spent reading each week. More specifically:

(a) Determine the least squares equation for predicting weekly reading
time from educational level.
(b) Faith’s education level is 15. What is her predicted reading time?
(c) Keegan’s educational level is 11. What is his predicted reading time?

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Answer:

Standard Error Of Estimate( Sy/x)


 The standard error of the estimate is a measure of the accuracy of
predictions.
 The regression line is the line that minimizes the sum of squared
deviations of prediction (also called the sum of squares error), and the
standard error of the estimate is the square root of the average squared
deviation.
 The standard error of estimate represents a special kind of standard
deviation that reflects the magnitude of predictive error.
 It is a rough measure of the average amount of predictive error—that is,
as a rough measure of the average amount by which known Y values
deviate from their predicted Y values.
 This estimate of predictive error complies with the general format for any
sample standard deviation, that is, the square root of a sum of squares
term divided by its degrees of freedom.

 We can also estimate the overall predictive error by dealing directly with
predictive errors, Y − Y′, it is more efficient to use the following
computation formula:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Interpretation of r2
 The squared correlation coefficient, r2, provides a measure of predictive
accuracy that supplements the standard error of estimate, Sy/x
 r2 indicates the proportion of total variability in one variable that is
predictable from its relationship with the other variable.
 It is a statistical measure in a regression model that determines the
proportion of variance in the dependent variable that can be explained by
the independent variable. In other words, r-squared shows how well the
data fit the regression model (the goodness of fit).
 r-squared can take any values between 0 to 1. Although the statistical
measure provides some useful insights regarding the regression model,
the user should not rely only on the measure in the assessment of a
statistical model.
 In addition, it does not indicate the correctness of the regression model.
Therefore, the user should always draw conclusions about the model by
analyzing r-squared together with the other variables in a statistical
model.
 Expressing the equation for r in symbols, we have:

Example: Suppose
SSy = 80 and SSy/x =28.8
then

Multiple Regression Equations


 Serious predictive efforts usually involve multiple regression equations
composed of more than one predictor, or X, variable.
 Most generally, these equations take the form:
Y’ = b1(X1) + b2(X2) + b3(X3) + a
Where Y’ is dependent variable and X1, X2 and X3 are independent
(predictor or X) variable
 For instance, a serious effort to predict college GPA might culminate in
the following equation: Y’ = .410(X1)+.005(X2 ) + .001(X3 ) + 1.03
where Y′ represents predicted college GPA and X1, X2, and X3 refer to
high school GPA, IQ score, and SAT score, respectively.
 By capitalizing on the combined predictive power of several predictor
variables, these multiple regression equations supply more accurate

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 23
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

predictions for Y′ (often referred to as the criterion variable) than could


be obtained from a simple regression equation.
 These multiple regression equations share many common features with
the simple regression equations.
Regression toward the Mean
 Regression toward the mean refers to a tendency for scores, particularly
extreme scores, to shrink toward the mean.
 This tendency often appears among subsets of observations whose values
are extreme and at least partly due to chance.
 Regression toward the mean refers to the principle that, over repeated
sampling periods, outliers tend to revert to the mean. High performers
show disappointing results when they fail to continue delivering;
strugglers show sudden improvement.
 Regression toward the mean occurs when the correlation between two
measures is imperfect, and so one data point cannot predict the next data
point reliably.
 In other words, when we ignore regression toward the mean,
we overestimate the correlation between the two measures.
 For example, because of regression toward the mean, we would expect
that students who made the top five scores on the first mid exam would
not make the top five scores on the second mid exam. Although all five
students might score above the mean on the second mid exam, some of
their scores would regress back toward the mean.
 Example2: A military commander has two units return, one with 20%
casualties and another with 50% casualties. He praises the first and
berates the second. The next time, the two units return with the opposite
results. From this experience, he “learns” that praise weakens
performance and berating increases performance.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

The Regression Fallacy


 The regression fallacy is committed whenever regression toward the
mean is interpreted as a real, rather than a chance, effect.
 If misinterpreted as a real effect, regression toward the mean can lead to
erroneous conclusions called regression fallacy.
 The Regression Fallacy occurs when one mistakes regression to the
mean, for a causal relationship. For example, if a tall father were to
conclude that his tall wife committed adultery because their children were
shorter, he would be committing the regression fallacy.
 The regression fallacy can be avoided by splitting the subset of extreme
observations into two groups.
Tutorial Questions:
1. What is normal curve? List out the properties of normal curve
2. Explain in detail about z scores
3. Outline standard normal curve and standard normal table
4. Explain in detail about finding proportions and finding scores.
(or) What are two types of normal curve problems? How to answer these
problems
5. Explain in detail about z scores for non-normal distribution
6. Discuss the three types of relationships with example. How to categories
these types of relationships using scatter plot and correlation coefficient.
7. Highlight the significance of correlation coefficient? Outline the
procedure for finding correlation coefficient using computational formula
with example and corresponding python program.
8. Explain the significance of regression line and least square regression line
with examples.
9. Calculate and analyze the correlation coefficient between the number of
study hours and the number of sleeping hours of different students.
Number of study Hours 2 4 6 8 10
Number of Sleeping Hours 10 9 8 7 6
10. How standard error of estimate is calculated
11. What is significance of r2? Give a detailed interpretation of r2?
12. Elucidate regression towards the mean with example. Explain regression
fallacy and state how it can be avoided.
13. Discuss scatter plot with example and corresponding python program.
How to interpret scatter plot.
14. Each of the following pairs represents the number of licensed drivers (X )
and the number of cars (Y ) for seven houses in my neighborhood:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

i) Calculate correlation coefficient using the computation formula


ii)Determine the least squares equation for these data.
iii)
Determine the standard error of estimate.
iv)Predict the number of cars for each of two new families with two
and five drivers.
v) Compute r2
Assignment Questions:
1. Express each of the following scores as a z score:
(a) Margaret’s IQ of 135, given a mean of 100 and a standard deviation of 15
(b) a score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100
(c) a daily production of 2100 loaves of bread by a bakery, given a mean of
2180 and a standard deviation of 50
(d) Sam’s height of 69 inches, given a mean of 69 and a standard deviation of 3
(e) a thermometer-reading error of –3 degrees, given a mean of 0 degrees and
a standard deviation of 2 degrees
2. Find the proportion of the total area identified with the following statements:
(a) above a z score of 1.80
(b) between the mean and a z score of –0.43
(c) below a z score of –3.00
(d) between the mean and a z score of 1.65
(e) between z scores of 0 and –1.96
3. Assume that GRE scores approximate a normal curve with a mean of 500 and a
standard deviation of 100. Find the proportions that correspond to the target
area described by each of the following statements:
(a) less than 400
(b) more than 650
(c) less than 700
4. Assume that SAT math scores approximate a normal curve with a mean of 500
and a standard deviation of 100.Find the target area(s) described by each of the
following statements:
(a) more than 570
(b) less than 515
(c) between 520 and 540
(d) between 470 and 520
(e) more than 50 points above the mean
(f) more than 100 points either above or below the mean
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 26
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

5. For the normal distribution of burning times of electric light bulbs, with a mean
equal to 1200 hours and a standard deviation equal to 120 hours, what burning
time is identified with the
(a) upper 50 percent?
(b) lower 75 percent?
(c) lower 1 percent?
(d) middle 90 percent?
6. Assume that each of the raw scores listed originates from a distribution with
the specified mean and standard deviation. After converting each raw score
into a z score, transform each z score into a series of new standard scores with
means and standard deviations of 50 and 10, 100 and 15, and 500 and 100,
respectively

7. Indicate whether the following statements suggest a positive or negative


relationship:
(a) More densely populated areas have higher crime rates.
(b) Schoolchildren who often watch TV perform more poorly on academic
achievement tests.
(c) Heavier automobiles yield poorer gas mileage.
(d) Better-educated people have higher incomes.
(e) More anxious people voluntarily spend more time performing a simple
repetitive task.
8. Couples who attend a clinic for first pregnancies are asked to estimate
(independently of each other) the ideal number of children. Given that X and Y
represent the estimates of females and males, respectively, the results are as
follows:

Calculate a value for correlation coefficient r, using the computation formula

9. Each of the following pairs represents the number of licensed drivers (X ) and
the number of cars (Y ) for seven houses in my neighborhood:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 27
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

(a) Calculate a value for correlation coefficient r, using the computation


formula
(b) Determine the least squares equation for these data.
(c) Determine the standard error of estimate.
(d) Predict the number of cars for each of two new families with two and five
drivers.
(e) Determine the square of the correlation coefficient r2

10. Consider the following data

(a) Calculate a value for correlation coefficient r, using the computation


formula
(b) Determine the least squares equation for these data.
(c) Determine the standard error of estimate.
(d) Determine the square of the correlation coefficient.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 28
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Standard Normal Table (Table A)

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 29
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 30

You might also like