Unit Iv
Unit Iv
4.2 Z SCORES
where X is the original score and μ and σ are the mean and the
standard deviation, respectively,
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the
mean; and
2. a number indicating the size of its deviation from the mean in
standard deviation units. FBI applicants, replace X with 66 (the
maximum permissible height), μ with 69 (the mean height),
and σ with 3 (the standard deviation of heights) and solve for z
as follows:
66-69/3 =-1
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
4. Find the target area. Refer to the standard normal table, using
the bottom legend, as the z score is negative. The arrows in
Table 5.1 show how to read the table. Look up column A’ to
1.00 (representing a z score of –1.00), and note the
corresponding proportion of .1587 in column C’: This is the
answer, as suggested in the right part of Figure.It can be
concluded that only .1587 (or .16) of all of the FBI applicants will
be shorter than 66 inches.
3. Convert X to z by expressing IQ
scores of 135 and 75 as Z=135-105/15
=30/15=2.00
Z=75-135/15 =-30/15 =-2.00
3. Find z.
The entry in column C closest to .2000 is .2005, and the
corresponding z score in column A equals 0.84. Verify this by
checking Table A. Also note that exactly the same z score of 0.84
would have been identified if column B had been searched to find
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
Standard Score
Whenever any unit-free scores are expressed relative to a
known mean and a known standard deviation, they are referred
to as standard scores. Although z scores qualify as standard
scores because they are unit-free and expressed relative to a
known mean of 0 and a known standard deviation of 1, other
scores also qualify as standard scores.
4.7 CORRELATION
Two variables are related if pairs of scores show an
orderliness that can be depicted graphically with a scatterplot
and numerically with a correlation coefficient.
AN INTUITIVE APPROACH
The suspected relationship does exist between cards
sent and cards received, then an inspection of the data might
reveal, as one possibility, a tendency for “big senders” to be “big
receivers” and for “small senders” to be “small receivers.” More
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
Positive Relationship
Postive relationships are relatively low values are paired with
relatively low values, and relatively high values are paired with
relatively high values, the relationship is positive. This
relationship implies “You get what you give.”
Negative Relationship
Negative relationships are relatively low values are paired with
relatively high values, and relatively high values are paired with
relatively low values, the relationship is negative. “You get the
opposite of what you give.”
Little or No Relationship
If any, relationship exists between the two variables and that
“What you get has no bearing on what you give.”
4.7 SCATTERPLOTS
A scatterplot is a graph containing a cluster of dots that
represents all pairs of scores. With a little training, you can use
The first step is to note the tilt or slope, if any, of a dot cluster. A
dot cluster that has a slope from the lower left to the upper
right, as in panel A of Figure 6.2, reflects a positive
relationship. Small values of one variable are paired with small
values of the other variable, and large values are paired with
large values. In panel A, short people tend to be light, and tall
people tend to be heavy.
In other hand, a dot cluster that has a slope from the upper left
to the lowerright, as in panel B of Figure reflects a negative
relationship. Small values ofone variable tend to be paired with
large values of the other variable, and vice versa.Finally, a dot
cluster that lacks any apparent slope, as in panel C of Figure
6.2,reflects little or no relationship. Small values of one variable
are just as likely tobe paired with small, medium, or large values
of the other variable.
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a
straight line reflects a perfect relationship between two variables.
In practice, perfect relationships are most unlikely.
Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates
a straight line and, therefore, reflects a linear relationship. But
this is not always the case. Sometimes a dot cluster approximates
a bent or curved line, as in Figure 6.4.
The scatterplot in Figure for the greeting card data. Although the
small number of dots in Figure hinders any interpretation, the
dot cluster appears to approximate a straight line, stretching
from the lower left to the upper right. This suggests a positive
relationship between greeting cards sent and received, in
agreementwith the earlier intuitive analysis of these data.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
Look again at the scatterplot in Figure for the greeting card data.
Although the small number of dots in Figure 6.1 hinders any
interpretation, the dot cluster appears to approximate a straight
line, stretching from the lower left to the upper right. This
suggests a positive relationship between greeting cards sent and
received, in agreement with the earlier intuitive analysis of these
data.
Key Properties of r
The Pearson correlation coefficient, r, can equal any value
between –1.00 and +1.00. Furthermore, the following two
properties apply:
1. The sign of r indicates the type of linear relationship, whether positive or
negative.
2. The numerical value of r, without regard to sign, indicates
the strength of the linear relationship.
Sign of r
A number with a plus sign (or no sign) indicates a positive
relationship, and a number with a minus sign indicates a
negative relationship.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
Numerical Value of r
The more closely a value of r approaches either –1.00 or +1.00,
the stronger (more regular) the relationship. Conversely, the more
closely the value of r approaches 0, the weaker (less regular) the
relationship. For example, an r of –.90 indicates a stronger
relationship than does an r of –.70, and an r of –.70 indicates
a stronger relationship than does an r of .50. (Remember, if no
sign appears, it is understood to be plus.) the value of r is a
measure of how well a straight line (representing the linear
relationship) describes the cluster of dots in the scatterplot.
Interpretation of r
Located along a scale from –1.00 to +1.00, the value of r supplies
information about the direction of a linear relationship—whether
positive or negative—and, generally, information about the
relative strength of a linear relationship—whether relatively weak
(and a poor describer of the data) because r is in the vicinity of 0,
or relatively strong (and a good describer of the data) because r
deviates from 0 in the direction of either +1.00 or –1.00.
where the two sum of squares terms in the denominator are defined
as
Calculation of r
OUTLIERS
Outliers were defined as very extreme scores that require special
attention because of their potential impact on a summary of data.
This is also true when outliers appear among sets of paired
scores. Although quantitative techniques can be used to detect
these outliers, we simply focus on dots in scatterplots that
deviate conspicuously from the main dot cluster.
and
The variances of x and y measure the variability of the x scores
and y scores around their respective sample means of X and Y
considered separately. The covariance measures the variability of
the (x,y) pairs around the mean of x and mean of y, considered
simultaneously.
We first summarize the gestational age data. The mean gestational age is:
Next, we summarize the birth weight data. The mean birth weight is:
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
Infant Birth
ID# Weight
1 1895 -1007 1,014,
049
2 2030 -872 760,38
4
3 1440 -1462 2,137,
444
4 2835 - 4,489
67
5 3090 188 35,344
6 3827 925 855,62
5
7 3260 358 128,16
4
8 2690 -212 44,944
9 3285 383 146,68
9
10 2920 18 324
11 3430 528 278,76
4
12 3657 755 570,02
5
13 3685 783 613,08
9
14 3345 443 196,24
9
15 3260 358 128,16
4
16 2680 -222 49,284
17 2005 -897 804,60
9
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
Infant
ID#
1 - -1007 3725.9
3.7
2 - -872 2092.8
2.4
3 - -1462 13,304.
9,1 2
4 1.7 - -113.9
67
5 - 188 -507.6
2.7
6 4.0 925 3700.0
7 1.9 358 680.2
8 - -212 233.2
1.1
9 2.5 383 957.5
10 - 18 -1.8
0.1
11 0.1 528 52.8
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
4.10 Regression
y = mx + c + e
3) Polynomial Regression
In a polynomial regression, the power of the
independent variable is more than 1. The equation below
represents a polynomial equation:
y = a + bx2
In this regression technique, the best fit line is not a straight
line. It is rather a curve that fits into the data points.
4) Logistic Regression
Logistic regression is a type of regression technique
when the dependent variable is discrete. Example: 0 or 1, true or
false, etc. This means the target variable can have only two
values,and a sigmoid function shows the relation between the
target variable and the independent variable.
The logistic function is used in Logistic Regression to create a
relation between the target variable and independent variables.
The below equation denotes the logistic regression.
here p is the probability of occurrence of the feature.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
5) Ridge Regression
Ridge Regression is another type of regression in
machine learning and is usually used when there is a high
correlation between the parameters. This is because as the
correlation increases the least square estimates give unbiased
values. But if the collinearity is very high, there can be some bias
value. Therefore, we introduce a bias matrix in the equation of
Ridge Regression. It is a powerful regression method where the
model is less susceptible to overfitting.
6) Lasso Regression
Lasso Regression performs regularization along with
feature selection. It avoids the absolute size of the regression
coefficient. This results in the coefficient value getting nearer to
zero, this property is different from what in ridge regression.
Therefore we use feature selection in Lasso Regression. In the
case of Lasso Regression, only the required parameters are used,
and the rest is made zero. This helps avoid the overfitting in the
model. But if independent variables are highly collinear, then
Lasso regression chooses
only one variable and makes other variables reduce to zero.
Below equation represents the Lasso Regression method:
N^{-1}Σ^{N}_{i=1}f(x_{i}, y_{I}, α, β)
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
Placement of Line
For the time being, forget about any prediction for Emma and
concentrate on how the five dots dictate the placement of the
regression line. If all five dots had defined a single straight line,
placement of the regression line would have been simple; merely
let it pass through all dots. When the dots fail to define a single
straight line, as in the scatterplot for the five friends, placement
of the regression line represents a compromise. It passes through
the main cluster, possibly touching some dots but missing
others.
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
Predictive Errors
………………>1
for b reads:
…………………………….>2
……………………………>3
....................>4
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
.....................>5
For the sake of the present argument, pretend that we know the
Y scores but not the corresponding X scores.Lacking information
about the relationship between X and Y scores, circumstances,
statisticians recommend repetitive predictions of the mean, Y, for
a variety of reasons, including the fact that, although the
predictive error for any individual might be quite large, the sum of
all of the resulting five predictive errors (deviations of Y scores
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
about Y) always equals zero, as you may recall from Section 3.3.]
Most important for our purposes, using the repetitive prediction
of Y for each of the Y scores of all five friends will supply us with
a frame of reference against which to evaluate our customary
predictive effort based on the correlation between cards sent (X)
and cards received (Y).
Predictive Errors
Using the errors for the five friends shown in Panel A of Figure 4.16,
this becomes
The error variability for the customized predictions from the least
squares equation can be designated as SSy|x, since each Y score
is expressed as a squared deviation from its corresponding Y’ and
then summed, that is
Using the errors for the five friends shown in Panel B of Figure 4.16,
obtain
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
......................................>4.16
where the one new sum of squares term, SSy′, is simply the
variability explained by or predictable from the regression
equation, that is,
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402
4.15 POPULATIONS
Any complete set of observations (or potential
observations) may be characterized as a population. Accurate
descriptions of populations specify the nature of the
observations to be taken. For example, a population might be
described as “attitudes toward abortion of currently enrolled
students at Bucknell University” or as “SAT critical reading
scores of currently enrolled students at Rutgers University.”
Real Populations
Pollsters, such as the Gallup Organization, deal with real
populations. A real population is one in which all potential
observations are accessible at the time of sampling. Examples of
real populations include the two described in the previous
paragraph, as well as the ages of all visitors to Disneyland on a
given day, the ethnic backgrounds of all current employees of
the U.S. Postal Department, and presidential preferences of all
currently registered voters in the United States. Incidentally,
federal law requires that a complete survey be taken every 10
years of the real population of all U.S. households—at
considerable expense, involving thousands of data collectors—as
a means of revising election districts for the House of
Representatives. (An estimated undercount of millions of people,
particularly minorities, in both the 2000 and 2010 censuses has
Hypothetical Populations
4.16 ANOVA:
One-Factor ANOVA
It describes the simplest type of analysis of variance.
Often referred to as a one-factor (or one-way) ANOVA, it tests
whether differences exist among population means categorized
by only one factor or independent variable, such as hours of
sleep deprivation, with measures on different subjects. The
ANOVA techniques described in this chapter presume that all
scores are independent. In other words, each subject contributes
just one score to the overall analysis.
EXAMPLE:
Imagine a simple experiment with three groups, each
containing four observations. For each of the following outcomes,
indicate whether there is variability between groups and also
whether there is variability within groups.
ANSWER:
F TEST
The null hypothesis has been tested with a t ratio. In the two-
sample case, t reflects the ratio between the observed difference between
the two sample means in the numerator and the estimated standard
error in the denominator. For three or more samples, the null hypothesis
is tested with a new ratio, the F ratio. Essentially, F reflects the ratio of
UNIT-IV DESCRIBING DATA ANALYSIS–II 21AD1402