0% found this document useful (0 votes)

114 views30 pages

Data Science - Unit-4

Uploaded by

nabisoj419

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views30 pages

Data Science - Unit-4

Uploaded by

nabisoj419

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

4-2 B.

Tech IT Regulation: R19 Data Science: UNIT-4

UNIT-4
Describing Data II
Syllabus:
Describing Data II: Normal distributions – z scores – normal curve problems–
finding proportions – finding scores –more about z scores – correlation – scatter
plots – correlation coefficient for quantitative data –computational formula for
correlation coefficient – regression – regression line – least squares regression line –
standard error of estimate – interpretation of r2– multiple regression equations –
regression toward the mean.
The Normal Distributions
 A Normal distribution (or Gaussian distribution) is a continuous
probability distribution that is symmetrical on both sides of the mean, so
that right side of the center is mirror image of the left side.
 Normal distribution is so important because it accurately describe the
distribution of values for many natural phenomena.
 Many observed frequency distributions approximate the well-documented
normal curve, an important theoretical curve noted for its symmetrical
bell-shaped form.
 Characteristics that are the sum of many independent processes
frequently follow normal distributions. For example, heights, blood
pressure, measurement error, and IQ scores follow the normal
distribution.
 The normal curve is defined in terms of standard deviation and mean.
 The normal curve can be used to obtain answers to a wide variety of
questions.

Properties of the Normal Curve:

Important properties of the normal curve are:
 The normal curve is a theoretical curve defined for a continuous
variable.
 The normal curve is symmetrical, its lower half is the mirror image of its
upper half.
 It is in bell-shaped form
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 The normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the peak.
 The curve approaches the x-axis, but it never touches, and it extends
farther away from the mean.
 The values of the mean, median and mode, located at a point midway
along the horizontal spread, are the same for the normal curve.
 The total area under the curve should be equal to 1.
 The normal distribution curve must have only one peak. (i.e., unimodal)
Different Normal Curves
 When using the normal curve, two bits of information are indispensable:
values for the mean and the standard deviation
 Various types of normal curves are produced by an arbitrary change in
the value of either the mean (μ) or the standard deviation (σ)
 Every normal curve can be interpreted in exactly the same way once any
distance from the mean is expressed in standard deviation units

Z Scores
 A unit-free, standardized score that indicates how many standard
deviations a score is above or below the mean of its distribution is called
Z Score
 To obtain a z score, express any original score, whether measured in
inches, milliseconds, dollars, IQ points, etc., as a deviation from its mean
(by subtracting its mean) and then split this deviation into standard
deviation units (by dividing by its standard deviation), that is,

where X is the original score and μ and σ are the mean and the standard
deviation, respectively, for the normal distribution of the original scores

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 2
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 A z score consists of two parts:

1. a positive or negative sign indicating whether it’s above or below
the mean; and
2. a number indicating the size of its deviation from the mean in
standard deviation units.
 Example: A z score of 2.00 always signifies that the original score is
exactly two standard deviations above its mean. Similarly, a z score of –
1.27 signifies that the original score is exactly 1.27 standard deviations
below its mean. A z score of 0 signifies that the original score coincides
with the mean.
 Problem: Express each of the following scores as a z score:
(a) Margaret’s IQ of 135, given a mean of 100 and a standard deviation
of 15
(b) a score of 470 on the SAT math test, given a mean of 500 and a
standard deviation of 100
(c) a daily production of 2100 loaves of bread by a bakery, given a
mean of 2180 and a standard deviation of 50
(d) Sam’s height of 69 inches, given a mean of 69 and a standard
deviation of 3
(e) a thermometer-reading error of –3 degrees, given a mean of 0 degrees
and a standard deviation of 2 degrees
Answers:
(a) z = (135-100)/15= 2.33
(b) z = (470-500)/100= 0.30
(c) z = (2100-2180)/50= -1.60
(d) z = (69-69)/3= 0.00
(e) z = (-3-0)/2= -1.50

STANDARD NORMAL CURVE

 If the original distribution approximates a normal curve, then the shift to
standard or z scores will always produce a new distribution that
approximates the standard normal curve.
 This is the one normal curve for which a table is actually available.
 The standard normal curve always has a mean of 0 and a standard
deviation of 1.
 Although there is infinite number of different normal curves, each with its
own mean and standard deviation, there is only one standard normal
curve, with a mean of 0 and a standard deviation of 1.
 Converting all original observations into z scores leaves the normal shape
intact but not the units of measurement.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Standard Normal Table (Z Table)

 The standard normal table consists of columns of z scores coordinated
with columns of proportions.
 In a typical problem, access to the table is gained through a z score, such
as –1.00, and the answer is read as a proportion
 Table columns are arranged in sets of three, designated as A, B, and C in
the legend at the top of the table. When using the top legend, all entries
refer to the upper half of the standard normal curve.
 The entries in column A are z scores, beginning with 0.00 and ending
with 4.00. Given a z score of zero, column B indicates the proportion of
area between the mean and the z score, and column C indicates the
proportion of area beyond the z score, in the upper tail of the standard
normal curve.
 Because of the symmetry of the normal curve, the entries in table also can
refer to the lower half of the normal curve. Now the columns are
designated as A′, B′, and C′ in the legend at the bottom of the table. When
using the bottom legend, all entries refer to the lower half of the standard
normal curve.
 The nonzero entries in column A′ are negative z scores, beginning with
0.01 and ending with 4.00.
 Column B′ indicates the proportion of area between the mean and the
negative z score, and column C′ indicates the proportion of area beyond
the negative z score, in the lower tail of the standard normal curve.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Normal Curve Problems

 There are two general types of normal curve problems:
(1) Finding proportions: these problems require finding the unknown
proportion (of area) associated with some score or pair of scores and
(2) Finding scores: these problems require finding the unknown score or
scores associated with some area.
 Answers to the first type of problem usually require converting original
scores into z scores and answers to the second type of problem usually
require translating a z score back into an original score.
 Rough graphs of normal curves can be used an aid to visualizing the
solution. Only after thinking through to a solution, do any calculations
and consult the normal tables.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Fig: Interpretation of standard normal table

 When using the standard normal table, it is important to remember that
 For any z score, the corresponding proportions in columns B and
C (or columns B′ and C′) always sum to .5000.
 Similarly, the total area under the normal curve always equals
1.0000, the sum of the proportions in the lower and upper halves,
that is, .5000 + .5000.
 Finally, although a z score can be either positive or negative, the
proportions of area under the curve are always positive or zero but
never negative
Finding Proportions
 In these Normal curve problems, standard normal table (table A) must be
consulted to find the unknown proportion (of area) associated with some
known score or pair of known scores.
Finding Proportions for One Score
 Step-by-step procedure:
1. Sketch a normal curve and shade in the target area
2. Plan solution according to the normal table.
X 
3. Convert X to z using formula, z 

4. Find the target area.
 Example: to find the proportion of all persons who are shorter than
exactly 66 inches, given that the distribution of heights approximates a
normal curve with a mean of 69 inches and a standard deviation of 3
inches.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Finding Proportions between Two Scores

 Step-by-step procedure:
1. Sketch a normal curve and shade in the target area
2. Plan solution according to the normal table.
X 
3. Convert X to z using formula, z 

4. Find the target area.
 Example: Assume that, when not interrupted artificially, the gestation
periods for human foetuses approximate a normal curve with a mean of
270 days (9 months) and a standard deviation of 15 days. What
proportion of gestation periods will be between 245 and 255 days?

Finding Proportions beyond Two Scores

 Step-by-step procedure:
1. Sketch a normal curve and shade in the two target areas
2. Plan your solution according to the normal table.
X 
3. Convert X to z using formula, z 

4. Find the target area.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 Problem: Assume that high school students’ IQ scores approximate a

normal distribution with a mean of 105 and a standard deviation of 15.
What proportion of IQs are more than 30 points either above or below the
mean?
Answer:
Expressing IQ scores of 135 and 75 as

Finding Scores
 In this type of normal curve problems standard normal table (table A)
must be consulted to find the unknown score or scores associated with
some known proportion.
 Essentially, this type of problem requires that the use of table A by
entering proportions in columns B, C, B′, or C′ and finding z scores listed
in columns A or A′.
Finding One Score
 Step-by-step procedure:
1. Sketch a normal curve and, on the correct side of the mean, draw a
line representing the target score
2. Plan your solution according to the normal table.
3. Find z.
4. Convert z to the target score using formula, X=  + (z) (  )

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 Problem: Exam scores for a large psychology class approximate a normal

curve with a mean of 230 and a standard deviation of 50. Furthermore,
students are graded “on a curve,” with only the upper 20 percent being
awarded grades of A. What is the lowest score on the exam that receives
an A?

Finding Two Scores

 Step-by-step procedure:
1. Sketch a normal curve. On either side of the mean, draw two lines
representing the two target scores
2. Plan your solution according to the normal table.
3. Find z.
4. Convert z to the target score, using formula X =  + (z) (  )
 Problem: Assume that the annual rainfall in the San Francisco area
approximates a normal curve with a mean of 22 inches and a standard
deviation of 4 inches. What are the rainfalls for the more atypical years,
defined as the driest 2.5 percent of all years and the wettest 2.5 percent of
all years?

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

More About Z Scores

Z Scores for Non-normal Distributions
 z scores are not limited to normal distributions.
 Non-normal distributions also can be transformed into sets of unit-free,
standardized z scores.
 In this case, the standard normal table cannot be consulted, since the
shape of the distribution of z scores is the same as that for the original
non-normal distribution.
 Regardless of the shape of the distribution, the shift to z scores always
produces a distribution of standard scores with a mean of 0 and a standard
deviation of 1.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 Z scores can provide efficient descriptions of relative performance on one

or more tests.
 The use of z scores can help to identify a person’s relative strengths and
weaknesses on several different tests.

 For example, above table shows Sharon’s scores on college achievement

tests in three different subjects. The evaluation of her test performance is
greatly facilitated by converting her raw scores into the z scores listed in
the final column of above table. A glance at the z scores suggests that
although she did relatively well on the math test, her performance on the
English test was only slightly above average, as indicated by a z score of
0.50, and her performance on the psychology test was slightly below
average, as indicated by a z score of –0.67.
Standard Score
 Any unit-free scores expressed relative to a known mean and a known
standard deviation is called standard score.
 Although z scores qualify as standard scores because they are unit-free
and expressed relative to a known mean of 0 and a known standard
deviation of 1, other scores also qualify as standard scores.
Transformed Standard Scores
 z scores can be changed to transformed standard scores, other types of
unit-free standard scores that lack negative signs and decimal points.
 These transformations change neither the shape of the original
distribution nor the relative standing of any test score within the
distribution.
 For example, a test score located one standard deviation below the mean
might be reported not as a z score of –1.00 but as a T score of 40 in a
distribution of T scores with a mean of 50 and a standard deviation of 10.
 Following figure shows the values of some of the more common types of
transformed standard scores relative to the various portions of the area
under the normal curve.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Converting to Transformed Standard Scores

 Following formula can be used to convert any original standard score, z,
into a transformed standard score, z′, having a distribution with any
desired mean and standard deviation.
z’ = desired mean + (z) (desired standard deviation)
where z′ (called z prime) is the transformed standard score and z is the
original standard score.
 Problem: Assume that each of the raw scores listed originates from a
distribution with the specified mean and standard deviation. After
converting each raw score into a z score, transform each z score into a
series of new standard scores with means and standard deviations of 50
and 10, 100 and 15, and 500 and 100, respectively.

Answers:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Correlation
 Two variables are related if pairs of scores show an orderliness that can
be depicted graphically with a scatter plot and numerically with a
correlation coefficient.
 The data in following table represent a very simple observational study
with two dependent variables.

Three Types of Relationships (Types of correlation)

 Positive Relationship
 Negative Relationship
 Little or No Relationship
Positive Relationship
 Two variables are positively related if pairs of scores tend to occupy
similar relative positions (relatively low values are paired with relatively
low values, and relatively high values are paired with relatively high
values,) in their respective distributions.
 Example: (Height, Weight)
(Temperature, Ice cream sales)
Negative Relationship
 Two variables are negatively related if pairs of scores tend to occupy
dissimilar relative positions (relatively low values are paired with
relatively high values, and relatively high values are paired with
relatively low values,) in their respective distributions.
 Example: (Exercise, Body Fat)
(Watching Movies, Exam scores)
Little or No Relationship
 No regularity is apparent among the pairs of scores
 Example: (Shoe Size, Movies Watched)
(Coffee Consumption, Intelligence)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Describing relationship between pairs of variables

 There are two more efficient and exact statistical techniques for
describing relationship between two variables, namely, a special graph
known as a scatter plot and a measure known as a correlation coefficient.
Scatter Plots
 A scatter plot is a graph containing a cluster of dots that represents all
pairs of scores.
 We can use any dot cluster as a preview of a fully measured relationship.
Construction
 To construct a scatter plot scale each of the two variables along the
horizontal (X) and vertical (Y) axes, and use each pair of scores to locate
a dot within he scatter plot.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Categorizing relationship using scatter plot

( Positive, Negative, or Little or No Relationship?)
 A dot cluster that has a slope from the lower left to the upper right
reflects a positive relationship. Small values of one variable are paired
with small values of the other variable, and large values are paired with
large values.
 Example: In panel A of below figure, short people tend to be light, and
tall people tend to be heavy.
 A dot cluster that has a slope from the upper left to the lower right
reflects a negative relationship. Small values of one variable tend to be
paired with large values of the other variable, and vice versa.
 Example: In panel B of below figure, people who have smoked heavily
for few years or not at all tend to have longer lives, and people who have
smoked heavily for many years tend to have shorter lives
 A dot cluster that lacks any apparent slope reflects little or no
relationship. Small values of one variable are just as likely to be paired
with small, medium, or large values of the other variable.
 Example: In panel C of below figure, notice that the dots are strewn about
in an irregular shotgun fashion, suggesting that there is little or no
relationship between the height of young adults and their life
expectancies.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 15
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Perfect Relationship
 A dot cluster that equals (rather than merely approximates) a straight line
reflects a perfect relationship between two variables.
Linear Relationship
 A relationship that can be described best with a straight line.
Curvilinear Relationship
 A relationship that can be described best with a curved line.

A Correlation Coefficient For Quantitative Data : r

 A correlation coefficient is a number between –1 and 1 that describes
the relationship between pairs of variables.
 The type of correlation coefficient, designated as r, that describes the
linear relationship between pairs of variables for quantitative data is
called the Pearson correlation coefficient, r, can equal any value
between –1.00 and +1.00.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 Furthermore, the following two properties apply:

1. The sign of r indicates the type of linear relationship, whether positive
or negative.
2. The numerical value of r, without regard to sign, indicates the strength
of the linear relationship.
 A number with a plus sign (or no sign) indicates a positive relationship,
and a number with a minus sign indicates a negative relationship. For
example, an r with a plus sign describes the positive relationship between
height and weight, and an r with a minus sign describes the negative
relationship between heavy smoking and life expectancy.
 The more closely a value of r approaches either –1.00 or +1.00, the
stronger (more regular) the relationship. Conversely, the more closely the
value of r approaches 0, the weaker (less regular) the relationship.
 For example, an r of –.90 indicates a stronger relationship than does an r
of –.70, and an r of –.70 indicates a stronger relationship than does an r of
.50
 A correlation coefficient, regardless of size, never provides information
about whether an observed relationship reflects a simple cause-effect
relationship or some more complex state of affairs.
Computation Formula for Correlation Coefficient
 Correlation Coefficient can be calculated by using following Computation
Formula

where the two sum of squares terms in the denominator are defined as

and the sum of the products term in the numerator, SPxy, is defined as

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Problem: Couples who attend a clinic for first pregnancies are asked to
estimate (independently of each other) the ideal number of children. Given that
X and Y represent the estimates of females and males, respectively, the results
are as follows:

Calculate a value for r, using the computation formula

Answer:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Regression
 A regression is a statistical technique that relates a dependent variable to
one or more independent (explanatory) variables.
 A regression model is able to show whether changes observed in the
dependent variable are associated with changes in one or more of the
explanatory variables.
 Regression captures the correlation between variables observed in a data
set, and quantifies whether those correlations are statistically significant
or not.
Regression Line
 A regression line is a line that best describes the behaviour of a set of
data. In other words, it’s a line that best fits the trend of a given data.
 The purpose of the line is to describe the interrelation of a dependent
variable (Y variable) with one or many independent variables (X
variable).
 By using the equation obtained from the regression line an analyst can
forecast future behaviours of the dependent variable by inputting different
values for the independent ones.

Types of regression
The two basic types of regression are
 Simple linear regression: Simple linear regression uses one
independent variable to explain or predict the outcome of the
dependent variable Y
 Multiple linear regression: Multiple linear regressions use two or
more independent variables to predict the outcome

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Predictive Errors
 Prediction error refers to the difference between the predicted values
made by some model and the actual values.

Least Squares Regression Line

 The placement of the regression line minimizes not the total predictive
error but the total squared predictive error, that is, the total for all squared
predictive errors. When located in this fashion, the regression line is often
referred to as the least squares regression line.
 The Least Squares Regression Line is the line that minimizes the sum of
the residuals squared. The residual is the vertical distance between the
observed point and the predicted point, and it is calculated by subtracting
ˆy from y.
 Least Squares Regression Equation: an equation pinpoints the exact least
squares regression line for any scatter plot. Most generally, this equation
reads:
Y´ = bX + a
where Y´ represents the predicted value
X represents the known
b and a represent numbers calculated from the original correlation
analysis, described by

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

 The regression equation can be used to predict the Y’ value for given X
value by simply substituting X value in equation.

 Problem: Assume that an r of .30 describes the relationship between

educational level (highest grade completed) and estimated number of
hours spent reading each week. More specifically:

(a) Determine the least squares equation for predicting weekly reading
time from educational level.
(b) Faith’s education level is 15. What is her predicted reading time?
(c) Keegan’s educational level is 11. What is his predicted reading time?

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Answer:

Standard Error Of Estimate( Sy/x)

 The standard error of the estimate is a measure of the accuracy of
predictions.
 The regression line is the line that minimizes the sum of squared
deviations of prediction (also called the sum of squares error), and the
standard error of the estimate is the square root of the average squared
deviation.
 The standard error of estimate represents a special kind of standard
deviation that reflects the magnitude of predictive error.
 It is a rough measure of the average amount of predictive error—that is,
as a rough measure of the average amount by which known Y values
deviate from their predicted Y values.
 This estimate of predictive error complies with the general format for any
sample standard deviation, that is, the square root of a sum of squares
term divided by its degrees of freedom.

 We can also estimate the overall predictive error by dealing directly with
predictive errors, Y − Y′, it is more efficient to use the following
computation formula:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Interpretation of r2
 The squared correlation coefficient, r2, provides a measure of predictive
accuracy that supplements the standard error of estimate, Sy/x
 r2 indicates the proportion of total variability in one variable that is
predictable from its relationship with the other variable.
 It is a statistical measure in a regression model that determines the
proportion of variance in the dependent variable that can be explained by
the independent variable. In other words, r-squared shows how well the
data fit the regression model (the goodness of fit).
 r-squared can take any values between 0 to 1. Although the statistical
measure provides some useful insights regarding the regression model,
the user should not rely only on the measure in the assessment of a
statistical model.
 In addition, it does not indicate the correctness of the regression model.
Therefore, the user should always draw conclusions about the model by
analyzing r-squared together with the other variables in a statistical
model.
 Expressing the equation for r in symbols, we have:

Example: Suppose
SSy = 80 and SSy/x =28.8
then

Multiple Regression Equations

 Serious predictive efforts usually involve multiple regression equations
composed of more than one predictor, or X, variable.
 Most generally, these equations take the form:
Y’ = b1(X1) + b2(X2) + b3(X3) + a
Where Y’ is dependent variable and X1, X2 and X3 are independent
(predictor or X) variable
 For instance, a serious effort to predict college GPA might culminate in
the following equation: Y’ = .410(X1)+.005(X2 ) + .001(X3 ) + 1.03
where Y′ represents predicted college GPA and X1, X2, and X3 refer to
high school GPA, IQ score, and SAT score, respectively.
 By capitalizing on the combined predictive power of several predictor
variables, these multiple regression equations supply more accurate

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 23
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

predictions for Y′ (often referred to as the criterion variable) than could

be obtained from a simple regression equation.
 These multiple regression equations share many common features with
the simple regression equations.
Regression toward the Mean
 Regression toward the mean refers to a tendency for scores, particularly
extreme scores, to shrink toward the mean.
 This tendency often appears among subsets of observations whose values
are extreme and at least partly due to chance.
 Regression toward the mean refers to the principle that, over repeated
sampling periods, outliers tend to revert to the mean. High performers
show disappointing results when they fail to continue delivering;
strugglers show sudden improvement.
 Regression toward the mean occurs when the correlation between two
measures is imperfect, and so one data point cannot predict the next data
point reliably.
 In other words, when we ignore regression toward the mean,
we overestimate the correlation between the two measures.
 For example, because of regression toward the mean, we would expect
that students who made the top five scores on the first mid exam would
not make the top five scores on the second mid exam. Although all five
students might score above the mean on the second mid exam, some of
their scores would regress back toward the mean.
 Example2: A military commander has two units return, one with 20%
casualties and another with 50% casualties. He praises the first and
berates the second. The next time, the two units return with the opposite
results. From this experience, he “learns” that praise weakens
performance and berating increases performance.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

The Regression Fallacy

 The regression fallacy is committed whenever regression toward the
mean is interpreted as a real, rather than a chance, effect.
 If misinterpreted as a real effect, regression toward the mean can lead to
erroneous conclusions called regression fallacy.
 The Regression Fallacy occurs when one mistakes regression to the
mean, for a causal relationship. For example, if a tall father were to
conclude that his tall wife committed adultery because their children were
shorter, he would be committing the regression fallacy.
 The regression fallacy can be avoided by splitting the subset of extreme
observations into two groups.
Tutorial Questions:
1. What is normal curve? List out the properties of normal curve
2. Explain in detail about z scores
3. Outline standard normal curve and standard normal table
4. Explain in detail about finding proportions and finding scores.
(or) What are two types of normal curve problems? How to answer these
problems
5. Explain in detail about z scores for non-normal distribution
6. Discuss the three types of relationships with example. How to categories
these types of relationships using scatter plot and correlation coefficient.
7. Highlight the significance of correlation coefficient? Outline the
procedure for finding correlation coefficient using computational formula
with example and corresponding python program.
8. Explain the significance of regression line and least square regression line
with examples.
9. Calculate and analyze the correlation coefficient between the number of
study hours and the number of sleeping hours of different students.
Number of study Hours 2 4 6 8 10
Number of Sleeping Hours 10 9 8 7 6
10. How standard error of estimate is calculated
11. What is significance of r2? Give a detailed interpretation of r2?
12. Elucidate regression towards the mean with example. Explain regression
fallacy and state how it can be avoided.
13. Discuss scatter plot with example and corresponding python program.
How to interpret scatter plot.
14. Each of the following pairs represents the number of licensed drivers (X )
and the number of cars (Y ) for seven houses in my neighborhood:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

i) Calculate correlation coefficient using the computation formula

ii)Determine the least squares equation for these data.
iii)
Determine the standard error of estimate.
iv)Predict the number of cars for each of two new families with two
and five drivers.
v) Compute r2
Assignment Questions:
1. Express each of the following scores as a z score:
(a) Margaret’s IQ of 135, given a mean of 100 and a standard deviation of 15
(b) a score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100
(c) a daily production of 2100 loaves of bread by a bakery, given a mean of
2180 and a standard deviation of 50
(d) Sam’s height of 69 inches, given a mean of 69 and a standard deviation of 3
(e) a thermometer-reading error of –3 degrees, given a mean of 0 degrees and
a standard deviation of 2 degrees
2. Find the proportion of the total area identified with the following statements:
(a) above a z score of 1.80
(b) between the mean and a z score of –0.43
(c) below a z score of –3.00
(d) between the mean and a z score of 1.65
(e) between z scores of 0 and –1.96
3. Assume that GRE scores approximate a normal curve with a mean of 500 and a
standard deviation of 100. Find the proportions that correspond to the target
area described by each of the following statements:
(a) less than 400
(b) more than 650
(c) less than 700
4. Assume that SAT math scores approximate a normal curve with a mean of 500
and a standard deviation of 100.Find the target area(s) described by each of the
following statements:
(a) more than 570
(b) less than 515
(c) between 520 and 540
(d) between 470 and 520
(e) more than 50 points above the mean
(f) more than 100 points either above or below the mean
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 26
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

5. For the normal distribution of burning times of electric light bulbs, with a mean
equal to 1200 hours and a standard deviation equal to 120 hours, what burning
time is identified with the
(a) upper 50 percent?
(b) lower 75 percent?
(c) lower 1 percent?
(d) middle 90 percent?
6. Assume that each of the raw scores listed originates from a distribution with
the specified mean and standard deviation. After converting each raw score
into a z score, transform each z score into a series of new standard scores with
means and standard deviations of 50 and 10, 100 and 15, and 500 and 100,
respectively

7. Indicate whether the following statements suggest a positive or negative

relationship:
(a) More densely populated areas have higher crime rates.
(b) Schoolchildren who often watch TV perform more poorly on academic
achievement tests.
(c) Heavier automobiles yield poorer gas mileage.
(d) Better-educated people have higher incomes.
(e) More anxious people voluntarily spend more time performing a simple
repetitive task.
8. Couples who attend a clinic for first pregnancies are asked to estimate
(independently of each other) the ideal number of children. Given that X and Y
represent the estimates of females and males, respectively, the results are as
follows:

Calculate a value for correlation coefficient r, using the computation formula

9. Each of the following pairs represents the number of licensed drivers (X ) and
the number of cars (Y ) for seven houses in my neighborhood:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 27
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

(a) Calculate a value for correlation coefficient r, using the computation

formula
(b) Determine the least squares equation for these data.
(c) Determine the standard error of estimate.
(d) Predict the number of cars for each of two new families with two and five
drivers.
(e) Determine the square of the correlation coefficient r2

10. Consider the following data

(a) Calculate a value for correlation coefficient r, using the computation

formula
(b) Determine the least squares equation for these data.
(c) Determine the standard error of estimate.
(d) Determine the square of the correlation coefficient.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 28
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Standard Normal Table (Table A)

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 29
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 30

GoodBelly Sales Spreadsheet - Case Study
No ratings yet
GoodBelly Sales Spreadsheet - Case Study
72 pages
Q1 Answer 1: Module 6-Assignment - Power Bi
No ratings yet
Q1 Answer 1: Module 6-Assignment - Power Bi
5 pages
Stock Watson 3u Exercise Solutions Chapter 7 Instructors
No ratings yet
Stock Watson 3u Exercise Solutions Chapter 7 Instructors
13 pages
AIDS - DS - Lab Manual
No ratings yet
AIDS - DS - Lab Manual
13 pages
Statistics Probability
No ratings yet
Statistics Probability
66 pages
Visualization Errors
No ratings yet
Visualization Errors
34 pages
Ma5160 Applied Probability and Statistics: For Syllabus, Question Papers, Notes & Many More
100% (1)
Ma5160 Applied Probability and Statistics: For Syllabus, Question Papers, Notes & Many More
2 pages
FDS Iat-2 Part-B
No ratings yet
FDS Iat-2 Part-B
4 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
DATA ANALYTICS QUESTION BANK
No ratings yet
DATA ANALYTICS QUESTION BANK
4 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Data Science
100% (1)
Data Science
14 pages
AD3491 - Unit 3 - Inferential Statistics Important Questions 2 Marks With Answer --3-9 (1)
No ratings yet
AD3491 - Unit 3 - Inferential Statistics Important Questions 2 Marks With Answer --3-9 (1)
7 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Business Analytics and Data Mining Modeling Using R
No ratings yet
Business Analytics and Data Mining Modeling Using R
6 pages
Question Bank_CSE-DS
No ratings yet
Question Bank_CSE-DS
5 pages
Preliminaries For Data Analysis: Problem Statements
No ratings yet
Preliminaries For Data Analysis: Problem Statements
4 pages
Vanishing and Exploding
No ratings yet
Vanishing and Exploding
9 pages
Classification Error: Training Errors Generalization Errors
No ratings yet
Classification Error: Training Errors Generalization Errors
39 pages
Chapter 6 Measures of Skewness and Kurtosis
No ratings yet
Chapter 6 Measures of Skewness and Kurtosis
25 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Machine Learning: in Telugu
No ratings yet
Machine Learning: in Telugu
14 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
ANN Quiz | PDF | Artificial Neural Network | Computational Science
No ratings yet
ANN Quiz | PDF | Artificial Neural Network | Computational Science
17 pages
Regula Falsi Method
100% (1)
Regula Falsi Method
5 pages
MCQ Data Science
No ratings yet
MCQ Data Science
1 page
SOFT COMPUTING _NOTES_UNIT 4 and UNIT 5
No ratings yet
SOFT COMPUTING _NOTES_UNIT 4 and UNIT 5
32 pages
Crescentannualreport19 20
No ratings yet
Crescentannualreport19 20
1,146 pages
Presentation Day 3 - Lasso-Ridge Regression, Logistic Regression, SVM
No ratings yet
Presentation Day 3 - Lasso-Ridge Regression, Logistic Regression, SVM
56 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
ESC-CSBS601 PEC-IT602D Pattern Recognition
100% (1)
ESC-CSBS601 PEC-IT602D Pattern Recognition
2 pages
Support Vector Machines Problem Statement
No ratings yet
Support Vector Machines Problem Statement
27 pages
Fdsa Unit 2
No ratings yet
Fdsa Unit 2
89 pages
Model Question Paper - I - or - 5TH SEM
No ratings yet
Model Question Paper - I - or - 5TH SEM
7 pages
Data Visualization R Programming Power Bi Lab Record
No ratings yet
Data Visualization R Programming Power Bi Lab Record
29 pages
Cost Sheet Using Inheritance
No ratings yet
Cost Sheet Using Inheritance
7 pages
Notes On Stochastic Processes: 1 Learning Outcomes
No ratings yet
Notes On Stochastic Processes: 1 Learning Outcomes
26 pages
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
No ratings yet
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
7 pages
Worksheet - Data Visualization
No ratings yet
Worksheet - Data Visualization
3 pages
Objective Questions of Discrete Mathematics
No ratings yet
Objective Questions of Discrete Mathematics
144 pages
Data Literacy Questions All Types
No ratings yet
Data Literacy Questions All Types
2 pages
Statistics 2 Marks and Notes 2019
No ratings yet
Statistics 2 Marks and Notes 2019
37 pages
Kmbn It01_ Unit 4
No ratings yet
Kmbn It01_ Unit 4
19 pages
Probability Distributions
No ratings yet
Probability Distributions
14 pages
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer --3-9 (1)
No ratings yet
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer --3-9 (1)
7 pages
Resource Management Techniques - MC9242 Ii Mca
100% (1)
Resource Management Techniques - MC9242 Ii Mca
22 pages
1st Unit Notes
No ratings yet
1st Unit Notes
22 pages
Ma2262 Probability and Queuing Theory Question Bank Download
No ratings yet
Ma2262 Probability and Queuing Theory Question Bank Download
4 pages
Database Design and Management - AD3391 - Important Questions With Answer - Unit 2 - Relational Model and SQL
100% (1)
Database Design and Management - AD3391 - Important Questions With Answer - Unit 2 - Relational Model and SQL
12 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
K Mean Clustering 1
100% (1)
K Mean Clustering 1
12 pages
UNIT-V-MCA-305-ADVANCED DBMS
No ratings yet
UNIT-V-MCA-305-ADVANCED DBMS
25 pages
Noise Models in Image Processing
No ratings yet
Noise Models in Image Processing
4 pages
13-Mca-Or-Probability & Statistics
No ratings yet
13-Mca-Or-Probability & Statistics
3 pages
Unit 5
No ratings yet
Unit 5
104 pages
I M Com QT Final On16march2016
0% (1)
I M Com QT Final On16march2016
166 pages
fundamentals of Data science unit 3
No ratings yet
fundamentals of Data science unit 3
18 pages
UNIT IV
No ratings yet
UNIT IV
50 pages
UNIT -IV DS
No ratings yet
UNIT -IV DS
39 pages
Standard Scores
0% (1)
Standard Scores
26 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Assignment 5
No ratings yet
Assignment 5
6 pages
Nixon, Measuring Calibration in Deep Learning
No ratings yet
Nixon, Measuring Calibration in Deep Learning
14 pages
Unit-4 AI - SVM
No ratings yet
Unit-4 AI - SVM
21 pages
Machine Learning Based Advanced Crime Prediction and Analysis
No ratings yet
Machine Learning Based Advanced Crime Prediction and Analysis
7 pages
BSC CSIT Grading System 1st Sem and 2nd Sem Model Questions
No ratings yet
BSC CSIT Grading System 1st Sem and 2nd Sem Model Questions
19 pages
Multiple Regression Analysis(Three Variables) (1)
No ratings yet
Multiple Regression Analysis(Three Variables) (1)
11 pages
Syllabus220 2023 24
No ratings yet
Syllabus220 2023 24
7 pages
First Individual Assignment - Quantitative - Islam
No ratings yet
First Individual Assignment - Quantitative - Islam
7 pages
Econ 339 Final Cheat Sheet
No ratings yet
Econ 339 Final Cheat Sheet
2 pages
Task 02: Example of Analysing Data and Residual Volatility and Estimating ARCH and GARCH Models
No ratings yet
Task 02: Example of Analysing Data and Residual Volatility and Estimating ARCH and GARCH Models
12 pages
CHAPTER 4-5 Cashless Policy Data Analysis
No ratings yet
CHAPTER 4-5 Cashless Policy Data Analysis
22 pages
Analysis of the Influence and IImpact of Using Financial Technology on Shopee Pay and Gopay Applications among Students
No ratings yet
Analysis of the Influence and IImpact of Using Financial Technology on Shopee Pay and Gopay Applications among Students
5 pages
Time Series Econometrics TSE48M1 Assignment: Due 25 June 2021 100 MARKS
No ratings yet
Time Series Econometrics TSE48M1 Assignment: Due 25 June 2021 100 MARKS
3 pages
Laporan Statistik Psikologi KLPK 8 Rilllllllll
No ratings yet
Laporan Statistik Psikologi KLPK 8 Rilllllllll
39 pages
Psg Cas Tdc Ug Syllabus
No ratings yet
Psg Cas Tdc Ug Syllabus
117 pages
Asmare Terefe Final Edite PDF
No ratings yet
Asmare Terefe Final Edite PDF
87 pages
Article - Atika Ben Gamra
No ratings yet
Article - Atika Ben Gamra
32 pages
ICCRIP 2017 Paper - NICMAR
No ratings yet
ICCRIP 2017 Paper - NICMAR
21 pages
AMS 315 F2024 Computing Assignment 2
No ratings yet
AMS 315 F2024 Computing Assignment 2
4 pages
Fly Ash Market
No ratings yet
Fly Ash Market
30 pages
Chapter 4 - : Laguna, Sweetbabes Plaza, Kim Erika T. Soterol, Sharriz R
No ratings yet
Chapter 4 - : Laguna, Sweetbabes Plaza, Kim Erika T. Soterol, Sharriz R
20 pages
Bhati 2016
No ratings yet
Bhati 2016
32 pages
BDS-Homework-1-Submission.ipynb - Colab
No ratings yet
BDS-Homework-1-Submission.ipynb - Colab
11 pages
ISO 8466-1 1990 PDF Version (En)
100% (3)
ISO 8466-1 1990 PDF Version (En)
12 pages
Salt Cfa Level 2 Formulasheet 2025
No ratings yet
Salt Cfa Level 2 Formulasheet 2025
19 pages
Breakfast Skipping and Cognitive and Emotional Engagement at School
No ratings yet
Breakfast Skipping and Cognitive and Emotional Engagement at School
10 pages
Calculator Help
No ratings yet
Calculator Help
2 pages
5 Curve Fitting and Interpolation
No ratings yet
5 Curve Fitting and Interpolation
20 pages

Data Science - Unit-4

Uploaded by

Data Science - Unit-4

Uploaded by

4-2 B.

Tech IT Regulation: R19 Data Science: UNIT-4

Properties of the Normal Curve:

 A z score consists of two parts:

STANDARD NORMAL CURVE

Standard Normal Table (Z Table)

Normal Curve Problems

Fig: Interpretation of standard normal table

Finding Proportions between Two Scores

Finding Proportions beyond Two Scores

 Problem: Assume that high school students’ IQ scores approximate a

 Problem: Exam scores for a large psychology class approximate a normal

Finding Two Scores

More About Z Scores

 Z scores can provide efficient descriptions of relative performance on one

 For example, above table shows Sharon’s scores on college achievement

Converting to Transformed Standard Scores

Three Types of Relationships (Types of correlation)

Describing relationship between pairs of variables

Categorizing relationship using scatter plot

A Correlation Coefficient For Quantitative Data : r

 Furthermore, the following two properties apply:

Calculate a value for r, using the computation formula

Least Squares Regression Line

 Problem: Assume that an r of .30 describes the relationship between

Standard Error Of Estimate( Sy/x)

Multiple Regression Equations

predictions for Y′ (often referred to as the criterion variable) than could

The Regression Fallacy

i) Calculate correlation coefficient using the computation formula

7. Indicate whether the following statements suggest a positive or negative

Calculate a value for correlation coefficient r, using the computation formula

(a) Calculate a value for correlation coefficient r, using the computation

10. Consider the following data

(a) Calculate a value for correlation coefficient r, using the computation

Standard Normal Table (Table A)

You might also like