0% found this document useful (0 votes)
8 views10 pages

Discovering Knowledge in Data - 2014 - Larose - 2 - Partie2

Discovering Knowledge in Data - 2014 - Larose_2_Partie2

Uploaded by

fatNugly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Discovering Knowledge in Data - 2014 - Larose - 2 - Partie2

Discovering Knowledge in Data - 2014 - Larose_2_Partie2

Uploaded by

fatNugly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2.

6 MEASURES OF CENTER AND SPREAD 23

40

30
Count

20

10

0
0 1000 2000 3000 4000 5000
weight
Figure 2.5 Histogram of vehicle weights: can you find the outlier?

There appears to be one lonely vehicle in the extreme left tail of the distribution,
with a vehicle weight in the hundreds of pounds rather than in the thousands. Further
investigation (not shown) tells us that the minimum weight is 192.5 pounds, which
is undoubtedly our little outlier in the lower tail. As 192.5 pounds is rather light for
an automobile, we would tend to doubt the validity of this information.
We can surmise that perhaps the weight was originally 1925 pounds, with
the decimal inserted somewhere along the line. We cannot be certain, however, and
further investigation into the data sources is called for.
Sometimes two-dimensional scatter plots3 can help to reveal outliers in more
than one variable. Figure 2.6, a scatter plot of mpg against weightlbs, seems to have
netted two outliers.
Most of the data points cluster together along the horizontal axis, except for two
outliers. The one on the left is the same vehicle we identified in Figure 2.6, weighing
only 192.5 pounds. The outlier near the top is something new: a car that gets over
500 miles per gallon! Clearly, unless this vehicle runs on dilithium crystals, we are
looking at a data entry error.
Note that the 192.5 pound vehicle is an outlier with respect to weight but
not with respect to mileage. Similarly, the 500-mpg car is an outlier with respect
to mileage but not with respect to weight. Thus, a record may be an outlier in a
particular dimension but not in another. We shall examine numeric methods for
identifying outliers, but we need to pick up a few tools first.

2.6 MEASURES OF CENTER AND SPREAD

Suppose that we are interested in estimating where the center of a particular variable
lies, as measured by one of the numerical measures of center, the most common

3 See the Appendix for more on scatter plots.


24 CHAPTER 2 DATA PREPROCESSING

600

500

400
mpg

300

200

100

0 1000 2000 3000 4000 5000


weight
Figure 2.6 Scatter plot of mpg against Weightlbs shows two outliers.

of which are the mean, median, and mode. Measures of center are a special case
of measures of location, numerical summaries that indicate where on a number line
a certain characteristic of the variable lies. Examples of measures of location are
percentiles and quantiles.
The mean of a variable is simply the average of the valid values taken by
the variable. To find the mean, simply add up all the field values and divide by the
sample size. Here we introduce a bit of notation. The sample mean is denoted as
∑ ∑
x̄ (“x-bar”) and is computed as x̄ = x/n, where (capital sigma, the Greek letter
“S,” for “summation”) represents “sum all the values” and n represents the sample
size. For example, suppose that we are interested in estimating where the center of
the customer service calls variable lies from the churn data set that we will explore
in Chapter 3. IBM/SPSS Modeler supplies us with the statistical summaries shown
in Figure 2.7. The mean number of customer service calls for this sample of n =
3333 customers is given as x̄ = 1.563. Using the sum and the count statistics, we can
verify that

x 5209
x̄ = = = 1.563
n 3333

For variables that are not extremely skewed, the mean is usually not too far
from the variable center. However, for extremely skewed data sets, the mean becomes
less representative of the variable center. Also, the mean is sensitive to the presence of
outliers. For this reason, analysts sometimes prefer to work with alternative measures
of center, such as the median, defined as the field value in the middle when the field
values are sorted into ascending order. The median is resistant to the presence of
outliers. Other analysts may prefer to use the mode, which represents the field value
occurring with the greatest frequency. The mode may be used with either numerical
or categorical data, but is not always associated with the variable center.
2.6 MEASURES OF CENTER AND SPREAD 25

Customer Service Calls


Statistics
Count 3333
Mean 1.563
Sum 5209.000
Median 1
Mode 1
Figure 2.7 Statistical summary of customer service calls.

Note that measures of center do not always concur as to where the center of the
data set lies. In Figure 2.7, the median is 1, which means that half of the customers
made at least one customer service call; the mode is also 1, which means that the
most frequent number of customer service calls was 1. The median and mode agree.
However, the mean is 1.563, which is 56.3% higher than the other measures. This is
due to the mean’s sensitivity to the right-skewness of the data.
Measures of location are not sufficient to summarize a variable effectively. In
fact, two variables may have the very same values for the mean, median, and mode,
and yet have different natures. For example, suppose that stock portfolio A and stock
portfolio B contained five stocks each, with the price/earnings (P/E) ratios as shown
in Table 2.3. The portfolios are distinctly different in terms of P/E ratios. Portfolio A
includes one stock that has a very small P/E ratio and another with a rather large P/E
ratio. On the other hand, portfolio B’s P/E ratios are more tightly clustered around the
mean. But despite these differences, the mean, median, and mode of the portfolios,
P/E ratios are precisely the same: The mean P/E ratio is 10, the median is 11, and the
mode is 11 for each portfolio.
Clearly, these measures of center do not provide us with a complete picture.
What is missing are measures of spread or measures of variability, which will describe
how spread out the data values are. Portfolio A’s P/E ratios are more spread out than
those of portfolio B, so the measures of variability for portfolio A should be larger
than those of B.
Typical measures of variability include the range (maximum − minimum),
the standard deviation, the mean absolute deviation, and the interquartile range. The

TABLE 2.3 The two portfolios have the same mean,


median, and mode, but are clearly different

Stock Portfolio A Stock Portfolio B

1 7
11 8
11 11
11 11
16 13
26 CHAPTER 2 DATA PREPROCESSING

sample standard deviation is perhaps the most widespread measure of variability and
is defined by


(x − x̄ )2
s=
n−1
Because of the squaring involved, the standard deviation is sensitive to the
presence of outliers, leading analysts to prefer other measures of spread, such as the
mean absolute deviation, in situations involving extreme values.
The standard deviation can be interpreted as the “typical” distance between a
field value and the mean, and most field values lie within two standard deviations of
the mean. From Figure 2.7 we can state that the number of customer service calls
made by most customers lies within 2(1.315) = 2.63 of the mean of 1.563 calls. In
other words, most of the number of customer service calls lie within the interval
(−1.067, 4.193), that is, (0, 4). (This can be verified by examining the histogram of
customer service calls in Figure 3.14 in Chapter 3.)
More information about these statistics may be found in the Appendix. A more
complete discussion of measures of location and variability can be found in any
introductory statistics textbook, such as Larose [2].

2.7 DATA TRANSFORMATION

Variables tend to have ranges that vary greatly from each other. For example, if we
are interested in major league baseball, players’ batting averages will range from zero
to less than 0.400, while the number of home runs hit in a season will range from
zero to around 70. For some data mining algorithms, such differences in the ranges
will lead to a tendency for the variable with greater range to have undue influence
on the results. That is, the greater variability in home runs will dominate the lesser
variability in batting averages.
Therefore, data miners should normalize their numeric variables, in order to
standardize the scale of effect each variable has on the results. Neural networks benefit
from normalization, as do algorithms that make use of distance measures, such as the
k-nearest neighbor algorithm. There are several techniques for normalization, and we
shall examine three of the more prevalent methods. Let X refer to our original field
value, and X∗ refer to the normalized field value.

2.8 MIN-MAX NORMALIZATION

Min-max normalization works by seeing how much greater the field value is than the
minimum value min(X), and scaling this difference by the range. That is

∗ X − min(X) X − min(X)
Xmm = =
range(X) max(X) − min(X)
The summary statistics for weight are shown in Figure 2.8.
2.9 Z-SCORE STANDARDIZATION 27

Figure 2.8 Summary statistics for weight.

The minimum weight is 1613 pounds, and the range = max (X) − min (X) =
4997 − 1613 = 3384 pounds. Let us find the min-max normalization for three auto-
mobiles weighing 1613 pounds, 3384 pounds, and 4997, respectively.
r For an ultra-light vehicle, weighing only 1613 pounds (the field minimum), the
min-max normalization is
∗ X − min(X) 1613 − 1613
Xmm = = =0
range(X) 3384
Thus, data values that represent the minimum for the variable will have
a min-max normalization value of zero.
r The midrange equals the average of the maximum and minimum values in a
data set. That is,
max(X) + min(X) 4997 + 1613
midrange(X) = = = 3305 pounds
2 2
For a “midrange” vehicle (if any), which weighs exactly halfway between
the minimum weight and the maximum weight, the min-max normalization is
∗ X − min(X) 3305 − 1613
Xmm = = = 0.5
range(X) 3384
So the midrange data value has a min-max normalization value of 0.5.
r The heaviest vehicle has a min-max normalization value of
∗ X − min(X) 4497 − 1613
Xmm = = =1
range(X) 3384
That is, data values representing the field maximum will have a min-max nor-
malization of 1. To summarize, min-max normalization values will range from 0 to 1.

2.9 Z-SCORE STANDARDIZATION

Z-score standardization, which is very widespread in the world of statistical analysis,


works by taking the difference between the field value and the field mean value, and
scaling this difference by the standard deviation of the field values. That is
X − mean(X)
Z-score =
SD(X)
28 CHAPTER 2 DATA PREPROCESSING

Figure 2.8 tells us that mean(weight) = 3005.49 and SD(weight) = 852.49.


r For the vehicle weighing only 1613 pounds, the Z-score standardization is
X − mean(X) 1613 − 3005.49
Z-score = = ≈ −1.63
SD(X) 852.49
Thus, data values that lie below the mean will have a negative Z-score
standardization.
r For an “average” vehicle (if any), with a weight equal to mean(X) = 3005.49
pounds, the Z-score standardization is
X − mean(X) 3005.49 − 3005.49
Z-score = = = 0.
SD(X) 852.49
That is, values falling exactly on the mean will have a Z-score standard-
ization of zero.
r For the heaviest car, the Z-score standardization is
X − mean(X) 4997 − 3005.49
Z-score = = ≈ 2.34.
SD(X) 852.49
That is, data values that lie above the mean will have a positive Z-score
standardization.4

2.10 DECIMAL SCALING

Decimal scaling ensures that every normalized value lies between −1 and 1.
∗ X
Xdecimal =
10d
where d represents the number of digits in the data value with the largest absolute
value. For the weight data, the largest absolute value is |4997| = 4997, which has
d = 4 digits. The decimal scaling for the minimum and maximum weights is
∗ 1613 ∗ 4997
Min: Xdecimal = = 0.1613 Max: Xdecimal = = 0.4997
104 104

2.11 TRANSFORMATIONS TO ACHIEVE NORMALITY

Some data mining algorithms and statistical methods require that the variables be
normally distributed. The normal distribution is a continuous probability distribution
commonly known as the bell curve, which is symmetric. It is centered at mean 𝜇
(“myu”) and has its spread determined by standard deviation 𝜎 (sigma). Figure 2.9
shows the normal distribution that has mean 𝜇 = 0 and standard deviation 𝜎 = 1,
known as the standard normal distribution Z.
4 Also, for a given Z-score, we may find its associated data value. See the Appendix.
2.11 TRANSFORMATIONS TO ACHIEVE NORMALITY 29

Spread determined by σ

–3 –2 –1 0 1 2 3
Mean μ
Figure 2.9 Standard normal Z distribution.

It is a common misconception that variables that have had the Z-score standard-
ization applied to them follow the standard normal Z distribution. This is not correct!
It is true that the Z-standardized data will have mean 0 and standard deviation 1 (see
Figure 2.14), but the distribution may still be skewed. Compare the histogram of
the original weight data in Figure 2.10 with the Z-standardized data in Figure 2.11.
Both histograms are right-skewed; in particular, Figure 2.10 is not symmetric, and so
cannot be normally distributed.
We use the following statistic to measure the skewness of a distribution5 :
3 (mean − median)
Skewness =
standard deviation
For right-skewed data, the mean is greater than the median, and thus the skew-
ness will be positive (Figure 2.12), while for left-skewed data, the mean is smaller

25

20

15
Count

10

0
1000 2000 3000 4000 5000
weightlbs
Figure 2.10 Original data.

5 Find more about standard deviations in the Appendix.


30 CHAPTER 2 DATA PREPROCESSING

25

20

15
Count

10

0
–2 –1 0 1 2 3
weight_z
Figure 2.11 Z-Standardized data are still right-skewed, not normally distributed.

Median Mean
Figure 2.12 Right-skewed data have positive skewness.

than the median, generating negative values for skewness (Figure 2.13). For perfectly
symmetric data (such as in Figure 2.9) of course, the mean, median, and mode are all
equal, and so the skewness equals zero.
Much real-world data are right-skewed, including most financial data. Left-
skewed data are not as common, but often occurs when the data are right-censored,
such as test scores on an easy test, which can get no higher than 100. We use the
statistics for weight and weight_Z shown in Figure 2.14 to calculate the skewness for
these variables.
For weight we have

3 (mean − median) 3(3005.490 − 2835)


Skewness = = = 0.6
standard deviation 852.646
2.11 TRANSFORMATIONS TO ACHIEVE NORMALITY 31

Mean Median
Figure 2.13 Left-skewed data have negative skewness.

For weight_Z we have

3 (mean − median) 3(0 − (−0.2))


Skewness = = = 0.6
standard deviation 1

Thus, Z-score standardization has no effect on skewness.


To make our data “more normally distributed,” we must first make it sym-
metric, which means eliminating the skewness. To eliminate skewness, we apply a
transformation to the data. Common transformations √ are the natural log transfor-
mation ln(weight), the square root transformation weight, and the inverse square

root transformation 1∕ weight. Application of the square root transformation (Fig-
ure 2.15) somewhat reduces the skewness, while applying the ln transformation
(Figure 2.16) reduces skewness even further.

Figure 2.14 Statistics for calculating skewness.


32 CHAPTER 2 DATA PREPROCESSING

25

20

15
Count

10

0
40 50 60 70 80
sqrt(weight)
Figure 2.15 Square root transformation somewhat reduces skewness.

The statistics in Figure 2.17 is used to calculate the reduction in skewness:


3 (54.280 − 53.245)
Skewness(sqrt (weight)) = ≈ 0.40
7.709
3 (7.968 − 7.950)
Skewness(ln (weight)) = ≈ 0.19
0.284

Finally, we try the inverse square root transformation 1∕ weight, which gives
us the distribution in Figure 2.18. The statistics in Figure 2.19 gives us
3 (0.019 − 0.019)
Skewness (inverse sqrt (weight)) = =0
0.003
which indicates that we have eliminated the skewness and achieved a symmetric
distribution.

20

15
Count

10

0
7.25 7.50 7.75 8.00 8.25 8.50
In(weight)
Figure 2.16 Natural log transformation reduces skewness even further.

You might also like