0% found this document useful (0 votes)
3 views

fundamentals of Data science unit 3

This document covers normal distributions, including properties of the normal curve, z scores, and correlation analysis. It explains how to find proportions and scores using the standard normal table and discusses the significance of z scores in both normal and non-normal distributions. Additionally, it introduces scatter plots and correlation coefficients to analyze relationships between variables.

Uploaded by

kaleeswaranmmcas
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

fundamentals of Data science unit 3

This document covers normal distributions, including properties of the normal curve, z scores, and correlation analysis. It explains how to find proportions and scores using the standard normal table and discusses the significance of z scores in both normal and non-normal distributions. Additionally, it introduces scatter plots and correlation coefficients to analyze relationships between variables.

Uploaded by

kaleeswaranmmcas
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT-3

Normal distributions – z scores – normal curve problems – finding proportions – finding scores –
more about z scores – correlation – scatter plots – correlation coefficient for quantitative data –
computational formula for correlation coefficient

The Normal Distributions

 A Normal distribution (or Gaussian distribution) is a continuous probability distribution


that is symmetrical on both sides of the mean, so that right side of the center is mirror
image of the left side.
 Normal distribution is so important because it accurately describe the distribution of
values for many natural phenomena.
 Many observed frequency distributions approximate the well-documented normal curve,
an important theoretical curve noted for its symmetrical bell-shaped form.
 Characteristics that are the sum of many independent processes frequently follow normal
distributions. For example, heights, blood pressure, measurement error, and IQ scores
follow the normal distribution.
 The normal curve is defined in terms of standard deviation and mean.
 The normal curve can be used to obtain answers to a wide variety of questions.

Properties of the Normal Curve:


Important properties of the normal curve are:
 The normal curve is a theoretical curve defined for a continuous variable.
 The normal curve is symmetrical; its lower half is the mirror image of its upper
half.
 It is in bell-shaped form
 The normal curve peaks above a point midway along the horizontal spread and
then tapers off gradually in either direction from the peak.
 The curve approaches the x-axis, but it never touches, and it extends farther away
from the mean.
 The values of the mean, median and mode, located at a point midway along the
horizontal spread, are the same for the normal curve.
 The total area under the curve should be equal to 1.
 The normal distribution curve must have only one peak. (i.e., unimodal)

Different Normal Curves

 When using the normal curve, two bits of information are indispensable: values for the
mean and the standard deviation
 Various types of normal curves are produced by an arbitrary change in the value of either
the mean (μ) or the standard deviation (σ)
 Every normal curve can be interpreted in exactly the same way once any distance from
the mean is expressed in standard deviation units

Z Scores
 A unit-free, standardized score that indicates how many standard deviations a score is
above or below the mean of its distribution is called Z Score
 To obtain a z score, express any original score, whether measured in inches,
milliseconds, dollars, IQ points, etc., as a deviation from its mean(by subtracting its
mean) and then split this deviation into standard deviation units (by dividing by its
standard deviation), that is,

Where X is the original score and μ and σ are the mean and the standard deviation,
respectively, for the normal distribution of the original scores
 A z score consists of two parts:
1. A positive or negative sign indicating whether it’s above or below the mean;
and
2. A number indicating the size of its deviation from the mean in standard
deviation units.
 Example: A z score of 2.00 always signifies that the original score is exactly two
standard deviations above its mean. Similarly, a z score of – 1.27 signifies that
the original score is exactly 1.27 standard deviations below its mean. A z score of
0 signifies that the original score coincides with the mean.
 Problem: Express each of the following scores as a z score:
(a)Margaret’s IQ of 135, given a mean of 100 and a standard deviation of 15
(b)a score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100
(c)a daily production of 2100 loaves of bread by a bakery, given a mean of
2180 and a standard deviation of 50
(d)Sam’s height of 69 inches, given a mean of 69 and a standard deviation of
3
(e)a thermometer-reading error of – 3 degrees, given a mean of 0 degrees and
a standard deviation of 2 degrees
Answers:
(a) z = (135-100)/15=2.33
(b) z = (470-500)/100=0.30
(c) z = (2100-2180)/50= -1.60
(d) z = (69-69)/3=0.00
(e) z = (-3-0)/2=-1.50

STANDARD NORMAL CURVE:


 If the original distribution approximates a normal curve, then the shift to standard or z
scores will always produce a new distribution that approximates the standard normal
curve.
 This is the one normal curve for which a table is actually available.
 The standard normal curve always has a mean of 0 and a standard deviation of 1.
 Although there is infinite number of different normal curves, each with its own mean
and standard deviation, there is only one standard normal curve, with a mean of 0 and a
standard deviation of 1.
 Converting all original observations into z scores leaves the normal shape intact but not
the units of measurement.
Standard Normal Table (z score):
 The standard normal table consists of columns of z scores coordinated with
columns of proportions.
 In a typical problem, access to the table is gained through a z score, such as –
1.00, and the answer is read as a proportion
 Table columns are arranged in sets of three, designated as A, B, and C in the
legend at the top of the table. When using the top legend, all entries refer to the
upper half of the standard normal curve.
 The entries in column A are z scores, beginning with 0.00 and ending with 4.00.
Given a z score of zero, column B indicates the proportion of area between the
mean and the z score, and column C indicates the proportion of area beyond the z
score, in the upper tail of the standard normal curve.
 Because of the symmetry of the normal curve, the entries in table also can refer
to the lower half of the normal curve. Now the columns are designated as A ′, B′,
and C′ in the legend at the bottom of the table. When using the bottom legend, all
entries refer to the lower half of the standard normal curve.
 The nonzero entries in column A ′are negative z scores, beginning with0.01 and
ending with 4.00.
 Column B′ indicates the proportion of area between the mean and the negative z
score, and column C′ indicates the proportion of area beyond the negative z score,
in the lower tail of the standard normal curve.
Normal Curve Problems
 There are two general types of normal curve problems:
(1)Finding proportions: these problems require finding the unknown proportion (of
area) associated with some score or pair of scores and
(2)Finding scores: these problems require finding the unknown score or scores
associated with some area.
 Answers to the first type of problem usually require converting original scores into z
scores and answers to the second type of problem usually require translating a z score
back into an original score.
 Rough graphs of normal curves can be used an aid to visualizing the solution. Only after
thinking through to a solution, do any calculations and consult the normal tables.
 When using the standard normal table, it is important to remember that
 For any z score, the corresponding proportions in columns B and C (or
columns B′ and C′) always sum to .5000.
 Similarly, the total area under the normal curve always equals1.0000,
the sum of the proportions in the lower and upper halves, that is, .5000 +
.5000.
 Finally, although a z score can be either positive or negative, the
proportions of area under the curve are always positive or zero but never
negative.
Finding properties:
 In these Normal curve problems, standard normal table (table A) must be consulted to
find the unknown proportion (of area) associated with some known score or pair of
known scores.
Finding Proportions for One Score
 Step-by-step procedure:
1. Sketch a normal curve and shade in the target area
2. Plan solution according to the normal table.
3. Convert X to z using formula,

4. Find the target area.


Example: to find the proportion of all persons who are shorter than exactly 66 inches,
given that the distribution of heights approximates a normal curve with a mean of 69
inches and a standard deviation of 3inches.
Finding Proportions between Two Scores
 Step-by-step procedure:
1. Sketch a normal curve and shade in the target area
2. Plan solution according to the normal table.
3. Convert X to z using formula,

4. Find the target area.


 Example: Assume that, when not interrupted artificially, the gestation periods for human
foetuses approximate a normal curve with a mean of270 days (9 months) and a standard
deviation of 15 days. What proportion of gestation periods will be between 245 and 255
days?

Finding Proportions beyond Two Scores


 Step-by-step procedure:
1. Sketch a normal curve and shade in the two target areas
2. Plan your solution according to the normal table.
3. Convert X to z using formula,
4. Find the target area
 Problem:
Assume that high school students’ IQ scores approximate a normal distribution with a
mean of 105 and a standard deviation of 15.What proportion of IQs are more than 30
points either above or below the mean?
Answer: Expressing IQ scores of 135 and 75 as

Finding score:

 In this type of normal curve problems standard normal table (table A) must be consulted
to find the unknown score or scores associated with some known proportion.
 Essentially, this type of problem requires that the use of table A by entering proportions
in columns B, C, B′, or C′ and finding z scores listed in columns A or A ′.

Finding One Score:

Step-by-step procedure:

1. Sketch a normal curve and, on the correct side of the mean, draw a line representing the target
score

2. Plan your solution according to the normal table.


3. Find z.

4. Convert z to the target score using formula,

 Problem: Exam scores for a large psychology class approximate a normal curve with a
mean of 230 and a standard deviation of 50. Furthermore, students are graded “on a
curve,” with only the upper 20 percent being awarded grades of A. What is the lowest
score on the exam that receives an A?

Finding Two Scores


Step-by-step procedure:
1. Sketch a normal curve. On either side of the mean, draw two lines representing the two target
scores
2. Plan your solution according to the normal table.
3. Find z.
4. Convert z to the target score, using formula

Problem: Assume that the annual rainfall in the San Francisco are approximates a normal curve
with a mean of 22 inches and a standard deviation of 4 inches. What are the rainfalls for
the more atypical years, defined as the driest 2.5 percent of all years and the wettest 2.5
percent of all years?
More about Z Scores:
Z Scores for Non-normal Distributions:
 z scores are not limited to normal distributions.
 Non-normal distributions also can be transformed into sets of unit-free, standardized z
scores.
 In this case, the standard normal table cannot be consulted, since the shape of the
distribution of z scores is the same as that for the original non-normal distribution.
 Regardless of the shape of the distribution, the shift to z scores always produces a
distribution of standard scores with a mean of 0 and a standard deviation of 1.
 Z scores can provide efficient descriptions of relative performance on one or more tests.
 The use of z scores can help to identify a person’s relative strengths and weaknesses on
several different tests.

 For example, above table shows Sharon’s scores on college achievement tests in
three different subjects. The evaluation of her test performance is greatly
facilitated by converting her raw scores into the z scores listed in the final column
of above table. A glance at the z scores suggests that although she did relatively
well on the math test, her performance on the English test was only slightly above
average, as indicated by a z score of0.50, and her performance on the psychology
test was slightly below average, as indicated by a z score of – 0.67.Standard Score
 Any unit-free scores expressed relative to a known mean and a known standard
deviation is called standard score.
 Although z scores qualify as standard scores because they are unit-free and
expressed relative to a known mean of 0 and a known standard deviation of 1,
other scores also qualify as standard scores. Transformed Standard Scores
 z scores can be changed to transformed standard scores, other types of unit-free
standard scores that lack negative signs and decimal points.
 These transformations change neither the shape of the original distribution nor the
relative standing of any test score within the distribution.
 For example, a test score located one standard deviation below the mean might
be reported not as a z score of – 1.00 but as a T score of 40 in a distribution of T
scores with a mean of 50 and a standard deviation of 10.
 Following figure shows the values of some of the more common types of
transformed standard scores relative to the various portions of the area under the
normal curve.
Converting to Transformed Standard Scores
 Following formula can be used to convert any original standard score, z, into a
transformed standard score, z′, having a distribution with any desired mean and standard
deviation. z’ =desired mean + ( z) (desired standard deviation)where z′(called z prime) is
the transformed standard score and z is the original standard score.
 Problem: Assume that each of the raw scores listed originates from a distribution with
the specified mean and standard deviation. After converting each raw score into a z score,
transform each z score into a series of new standard scores with means and standard
deviations of 50and 10, 100 and 15, and 500 and 100, respectively.
 Two variables are related if pairs of scores show orderliness that can be depicted
graphically with a scatter plot and numerically with a correlation coefficient.
 The data in following table represent a very simple observational study with two
dependent variables.

Three Types of Relationships (Types of correlation)


 Positive Relationship
 Negative Relationship
 Little or No Relationship

Positive Relationship
 Two variables are positively related if pairs of scores tend to occupy similar relative
positions (relatively low values are paired with relatively low values, and relatively high
values are paired with relatively high values ,) in their respective distributions. Example:
(Height, Weight)(Temperature, Ice cream sales)

Negative Relationship

 Two variables are negatively related if pairs of scores tend to occupy dissimilar relative
positions (relatively low values are paired with relatively high values, and relatively high
values are paired with relatively low values ,) in their respective distributions.
 Example: (Exercise, Body Fat)(Watching Movies, Exam scores)

Little or No Relationship

 No regularity is apparent among the pairs of scores


 Example: (Shoe Size, Movies Watched)(Coffee Consumption, Intelligence)
Describing relationship between pairs of variables
 There are two more efficient and exact statistical techniques for describing
relationship between two variables, namely, a special graph known as a scatter
plot and a measure known as a correlation coefficient.

Scatter Plots

 A scatter plot is a graph containing a cluster of dots that represents all pairs of
scores.
 We can use any dot cluster as a preview of a fully measured relationship.

Construction
 To construct a scatter plot scale each of the two variables along the horizontal
(X) and vertical (Y) axes, and use each pair of scores to locate a dot within he
scatter plot.
Categorizing relationship using scatter plot

(Positive, Negative, or Little or No Relationship?)

 A dot cluster that has a slope from the lower left to the upper right reflects a
positive relationship. Small values of one variable are paired with small values of
the other variable, and large values are paired with large values.
 Example: In panel A of below figure, short people tend to be light, and tall
people tend to be heavy.
 A dot cluster that has a slope from the upper left to the lower right reflects a
negative relationship. Small values of one variable tend to be paired with large
values of the other variable, and vice versa.
 Example: In panel B of below figure, people who have smoked heavily for few
years or not at all tend to have longer lives, and people who have smoked heavily
for many years tend to have shorter lives
 A dot cluster that lacks any apparent slope reflects little or no relationship.
Small values of one variable are just as likely to be paired with small, medium,
or large values of the other variable. Example: In panel C of below figure,
notice that the dots are strewn about in an irregular shotgun fashion, suggesting
that there is little or no relationship between the height of young adults and their
life expectancies.
Perfect Relationship

A dot cluster that equals (rather than merely approximates) a straight line reflects a perfect
relationship between two variables.

Linear Relationship

A relationship that can be described best with a straight line.

Curvilinear Relationship

A relationship that can be described best with a curved line

A Correlation Coefficient For Quantitative Data : r


 A correlation coefficient is a number between – 1 and 1 that describes the relationship
between pairs of variables.
 The type of correlation coefficient, designated as r, that describes the linear relationship
between pairs of variables for quantitative data is called the Pearson correlation
coefficient, r, can equal any value between – 1.00 and +1.00.
 Furthermore, the following two properties apply:
1. The sign of r indicates the type of linear relationship, whether positive or negative
2 . The numerical value of r , without regard to sign, indicates the strength of the
linear relationship.
 A number with a plus sign (or no sign) indicates a positive relationship, and a number
with a minus sign indicates a negative relationship. For example, an r with a plus sign
describes the positive relationship between height and weight, and an r with a minus sign
describes the negative relationship between heavy smoking and life expectancy.
 The more closely a value of r approaches either – 1.00 or +1.00, the stronger (more
regular) the relationship. Conversely, the more closely the value of r approaches 0, the
weaker (less regular) the relationship.
 For example, an r of – .90 indicates a stronger relationship than does an r of – .70, and an
r of – .70 indicates a stronger relationship than does anrof.50
 A correlation coefficient, regardless of size, never provides information about whether
an observed relationship reflects a simple cause-effect relationship or some more
complex state of affairs.

You might also like