Data Science - Unit-4
Data Science - Unit-4
UNIT-4
Describing Data II
Syllabus:
Describing Data II: Normal distributions – z scores – normal curve problems–
finding proportions – finding scores –more about z scores – correlation – scatter
plots – correlation coefficient for quantitative data –computational formula for
correlation coefficient – regression – regression line – least squares regression line –
standard error of estimate – interpretation of r2– multiple regression equations –
regression toward the mean.
The Normal Distributions
A Normal distribution (or Gaussian distribution) is a continuous
probability distribution that is symmetrical on both sides of the mean, so
that right side of the center is mirror image of the left side.
Normal distribution is so important because it accurately describe the
distribution of values for many natural phenomena.
Many observed frequency distributions approximate the well-documented
normal curve, an important theoretical curve noted for its symmetrical
bell-shaped form.
Characteristics that are the sum of many independent processes
frequently follow normal distributions. For example, heights, blood
pressure, measurement error, and IQ scores follow the normal
distribution.
The normal curve is defined in terms of standard deviation and mean.
The normal curve can be used to obtain answers to a wide variety of
questions.
The normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the peak.
The curve approaches the x-axis, but it never touches, and it extends
farther away from the mean.
The values of the mean, median and mode, located at a point midway
along the horizontal spread, are the same for the normal curve.
The total area under the curve should be equal to 1.
The normal distribution curve must have only one peak. (i.e., unimodal)
Different Normal Curves
When using the normal curve, two bits of information are indispensable:
values for the mean and the standard deviation
Various types of normal curves are produced by an arbitrary change in
the value of either the mean (μ) or the standard deviation (σ)
Every normal curve can be interpreted in exactly the same way once any
distance from the mean is expressed in standard deviation units
Z Scores
A unit-free, standardized score that indicates how many standard
deviations a score is above or below the mean of its distribution is called
Z Score
To obtain a z score, express any original score, whether measured in
inches, milliseconds, dollars, IQ points, etc., as a deviation from its mean
(by subtracting its mean) and then split this deviation into standard
deviation units (by dividing by its standard deviation), that is,
where X is the original score and μ and σ are the mean and the standard
deviation, respectively, for the normal distribution of the original scores
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 2
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Finding Scores
In this type of normal curve problems standard normal table (table A)
must be consulted to find the unknown score or scores associated with
some known proportion.
Essentially, this type of problem requires that the use of table A by
entering proportions in columns B, C, B′, or C′ and finding z scores listed
in columns A or A′.
Finding One Score
Step-by-step procedure:
1. Sketch a normal curve and, on the correct side of the mean, draw a
line representing the target score
2. Plan your solution according to the normal table.
3. Find z.
4. Convert z to the target score using formula, X= + (z) ( )
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Answers:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Correlation
Two variables are related if pairs of scores show an orderliness that can
be depicted graphically with a scatter plot and numerically with a
correlation coefficient.
The data in following table represent a very simple observational study
with two dependent variables.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 15
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line
reflects a perfect relationship between two variables.
Linear Relationship
A relationship that can be described best with a straight line.
Curvilinear Relationship
A relationship that can be described best with a curved line.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
where the two sum of squares terms in the denominator are defined as
and the sum of the products term in the numerator, SPxy, is defined as
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Problem: Couples who attend a clinic for first pregnancies are asked to
estimate (independently of each other) the ideal number of children. Given that
X and Y represent the estimates of females and males, respectively, the results
are as follows:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Regression
A regression is a statistical technique that relates a dependent variable to
one or more independent (explanatory) variables.
A regression model is able to show whether changes observed in the
dependent variable are associated with changes in one or more of the
explanatory variables.
Regression captures the correlation between variables observed in a data
set, and quantifies whether those correlations are statistically significant
or not.
Regression Line
A regression line is a line that best describes the behaviour of a set of
data. In other words, it’s a line that best fits the trend of a given data.
The purpose of the line is to describe the interrelation of a dependent
variable (Y variable) with one or many independent variables (X
variable).
By using the equation obtained from the regression line an analyst can
forecast future behaviours of the dependent variable by inputting different
values for the independent ones.
Types of regression
The two basic types of regression are
Simple linear regression: Simple linear regression uses one
independent variable to explain or predict the outcome of the
dependent variable Y
Multiple linear regression: Multiple linear regressions use two or
more independent variables to predict the outcome
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Predictive Errors
Prediction error refers to the difference between the predicted values
made by some model and the actual values.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
The regression equation can be used to predict the Y’ value for given X
value by simply substituting X value in equation.
(a) Determine the least squares equation for predicting weekly reading
time from educational level.
(b) Faith’s education level is 15. What is her predicted reading time?
(c) Keegan’s educational level is 11. What is his predicted reading time?
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Answer:
We can also estimate the overall predictive error by dealing directly with
predictive errors, Y − Y′, it is more efficient to use the following
computation formula:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Interpretation of r2
The squared correlation coefficient, r2, provides a measure of predictive
accuracy that supplements the standard error of estimate, Sy/x
r2 indicates the proportion of total variability in one variable that is
predictable from its relationship with the other variable.
It is a statistical measure in a regression model that determines the
proportion of variance in the dependent variable that can be explained by
the independent variable. In other words, r-squared shows how well the
data fit the regression model (the goodness of fit).
r-squared can take any values between 0 to 1. Although the statistical
measure provides some useful insights regarding the regression model,
the user should not rely only on the measure in the assessment of a
statistical model.
In addition, it does not indicate the correctness of the regression model.
Therefore, the user should always draw conclusions about the model by
analyzing r-squared together with the other variables in a statistical
model.
Expressing the equation for r in symbols, we have:
Example: Suppose
SSy = 80 and SSy/x =28.8
then
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 23
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
5. For the normal distribution of burning times of electric light bulbs, with a mean
equal to 1200 hours and a standard deviation equal to 120 hours, what burning
time is identified with the
(a) upper 50 percent?
(b) lower 75 percent?
(c) lower 1 percent?
(d) middle 90 percent?
6. Assume that each of the raw scores listed originates from a distribution with
the specified mean and standard deviation. After converting each raw score
into a z score, transform each z score into a series of new standard scores with
means and standard deviations of 50 and 10, 100 and 15, and 500 and 100,
respectively
9. Each of the following pairs represents the number of licensed drivers (X ) and
the number of cars (Y ) for seven houses in my neighborhood:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 27
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 28
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 29
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 30