fundamentals of Data science unit 3
fundamentals of Data science unit 3
Normal distributions – z scores – normal curve problems – finding proportions – finding scores –
more about z scores – correlation – scatter plots – correlation coefficient for quantitative data –
computational formula for correlation coefficient
When using the normal curve, two bits of information are indispensable: values for the
mean and the standard deviation
Various types of normal curves are produced by an arbitrary change in the value of either
the mean (μ) or the standard deviation (σ)
Every normal curve can be interpreted in exactly the same way once any distance from
the mean is expressed in standard deviation units
Z Scores
A unit-free, standardized score that indicates how many standard deviations a score is
above or below the mean of its distribution is called Z Score
To obtain a z score, express any original score, whether measured in inches,
milliseconds, dollars, IQ points, etc., as a deviation from its mean(by subtracting its
mean) and then split this deviation into standard deviation units (by dividing by its
standard deviation), that is,
Where X is the original score and μ and σ are the mean and the standard deviation,
respectively, for the normal distribution of the original scores
A z score consists of two parts:
1. A positive or negative sign indicating whether it’s above or below the mean;
and
2. A number indicating the size of its deviation from the mean in standard
deviation units.
Example: A z score of 2.00 always signifies that the original score is exactly two
standard deviations above its mean. Similarly, a z score of – 1.27 signifies that
the original score is exactly 1.27 standard deviations below its mean. A z score of
0 signifies that the original score coincides with the mean.
Problem: Express each of the following scores as a z score:
(a)Margaret’s IQ of 135, given a mean of 100 and a standard deviation of 15
(b)a score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100
(c)a daily production of 2100 loaves of bread by a bakery, given a mean of
2180 and a standard deviation of 50
(d)Sam’s height of 69 inches, given a mean of 69 and a standard deviation of
3
(e)a thermometer-reading error of – 3 degrees, given a mean of 0 degrees and
a standard deviation of 2 degrees
Answers:
(a) z = (135-100)/15=2.33
(b) z = (470-500)/100=0.30
(c) z = (2100-2180)/50= -1.60
(d) z = (69-69)/3=0.00
(e) z = (-3-0)/2=-1.50
Finding score:
In this type of normal curve problems standard normal table (table A) must be consulted
to find the unknown score or scores associated with some known proportion.
Essentially, this type of problem requires that the use of table A by entering proportions
in columns B, C, B′, or C′ and finding z scores listed in columns A or A ′.
Step-by-step procedure:
1. Sketch a normal curve and, on the correct side of the mean, draw a line representing the target
score
Problem: Exam scores for a large psychology class approximate a normal curve with a
mean of 230 and a standard deviation of 50. Furthermore, students are graded “on a
curve,” with only the upper 20 percent being awarded grades of A. What is the lowest
score on the exam that receives an A?
Problem: Assume that the annual rainfall in the San Francisco are approximates a normal curve
with a mean of 22 inches and a standard deviation of 4 inches. What are the rainfalls for
the more atypical years, defined as the driest 2.5 percent of all years and the wettest 2.5
percent of all years?
More about Z Scores:
Z Scores for Non-normal Distributions:
z scores are not limited to normal distributions.
Non-normal distributions also can be transformed into sets of unit-free, standardized z
scores.
In this case, the standard normal table cannot be consulted, since the shape of the
distribution of z scores is the same as that for the original non-normal distribution.
Regardless of the shape of the distribution, the shift to z scores always produces a
distribution of standard scores with a mean of 0 and a standard deviation of 1.
Z scores can provide efficient descriptions of relative performance on one or more tests.
The use of z scores can help to identify a person’s relative strengths and weaknesses on
several different tests.
For example, above table shows Sharon’s scores on college achievement tests in
three different subjects. The evaluation of her test performance is greatly
facilitated by converting her raw scores into the z scores listed in the final column
of above table. A glance at the z scores suggests that although she did relatively
well on the math test, her performance on the English test was only slightly above
average, as indicated by a z score of0.50, and her performance on the psychology
test was slightly below average, as indicated by a z score of – 0.67.Standard Score
Any unit-free scores expressed relative to a known mean and a known standard
deviation is called standard score.
Although z scores qualify as standard scores because they are unit-free and
expressed relative to a known mean of 0 and a known standard deviation of 1,
other scores also qualify as standard scores. Transformed Standard Scores
z scores can be changed to transformed standard scores, other types of unit-free
standard scores that lack negative signs and decimal points.
These transformations change neither the shape of the original distribution nor the
relative standing of any test score within the distribution.
For example, a test score located one standard deviation below the mean might
be reported not as a z score of – 1.00 but as a T score of 40 in a distribution of T
scores with a mean of 50 and a standard deviation of 10.
Following figure shows the values of some of the more common types of
transformed standard scores relative to the various portions of the area under the
normal curve.
Converting to Transformed Standard Scores
Following formula can be used to convert any original standard score, z, into a
transformed standard score, z′, having a distribution with any desired mean and standard
deviation. z’ =desired mean + ( z) (desired standard deviation)where z′(called z prime) is
the transformed standard score and z is the original standard score.
Problem: Assume that each of the raw scores listed originates from a distribution with
the specified mean and standard deviation. After converting each raw score into a z score,
transform each z score into a series of new standard scores with means and standard
deviations of 50and 10, 100 and 15, and 500 and 100, respectively.
Two variables are related if pairs of scores show orderliness that can be depicted
graphically with a scatter plot and numerically with a correlation coefficient.
The data in following table represent a very simple observational study with two
dependent variables.
Positive Relationship
Two variables are positively related if pairs of scores tend to occupy similar relative
positions (relatively low values are paired with relatively low values, and relatively high
values are paired with relatively high values ,) in their respective distributions. Example:
(Height, Weight)(Temperature, Ice cream sales)
Negative Relationship
Two variables are negatively related if pairs of scores tend to occupy dissimilar relative
positions (relatively low values are paired with relatively high values, and relatively high
values are paired with relatively low values ,) in their respective distributions.
Example: (Exercise, Body Fat)(Watching Movies, Exam scores)
Little or No Relationship
Scatter Plots
A scatter plot is a graph containing a cluster of dots that represents all pairs of
scores.
We can use any dot cluster as a preview of a fully measured relationship.
Construction
To construct a scatter plot scale each of the two variables along the horizontal
(X) and vertical (Y) axes, and use each pair of scores to locate a dot within he
scatter plot.
Categorizing relationship using scatter plot
A dot cluster that has a slope from the lower left to the upper right reflects a
positive relationship. Small values of one variable are paired with small values of
the other variable, and large values are paired with large values.
Example: In panel A of below figure, short people tend to be light, and tall
people tend to be heavy.
A dot cluster that has a slope from the upper left to the lower right reflects a
negative relationship. Small values of one variable tend to be paired with large
values of the other variable, and vice versa.
Example: In panel B of below figure, people who have smoked heavily for few
years or not at all tend to have longer lives, and people who have smoked heavily
for many years tend to have shorter lives
A dot cluster that lacks any apparent slope reflects little or no relationship.
Small values of one variable are just as likely to be paired with small, medium,
or large values of the other variable. Example: In panel C of below figure,
notice that the dots are strewn about in an irregular shotgun fashion, suggesting
that there is little or no relationship between the height of young adults and their
life expectancies.
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a perfect
relationship between two variables.
Linear Relationship
Curvilinear Relationship