0% found this document useful (0 votes)

8 views10 pages

Discovering Knowledge in Data - 2014 - Larose - 2 - Partie2

Discovering Knowledge in Data - 2014 - Larose_2_Partie2

Uploaded by

fatNugly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

Discovering Knowledge in Data - 2014 - Larose - 2 - Partie2

Discovering Knowledge in Data - 2014 - Larose_2_Partie2

Uploaded by

fatNugly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

2.

6 MEASURES OF CENTER AND SPREAD 23

30
Count

0
0 1000 2000 3000 4000 5000
weight
Figure 2.5 Histogram of vehicle weights: can you find the outlier?

There appears to be one lonely vehicle in the extreme left tail of the distribution,
with a vehicle weight in the hundreds of pounds rather than in the thousands. Further
investigation (not shown) tells us that the minimum weight is 192.5 pounds, which
is undoubtedly our little outlier in the lower tail. As 192.5 pounds is rather light for
an automobile, we would tend to doubt the validity of this information.
We can surmise that perhaps the weight was originally 1925 pounds, with
the decimal inserted somewhere along the line. We cannot be certain, however, and
further investigation into the data sources is called for.
Sometimes two-dimensional scatter plots3 can help to reveal outliers in more
than one variable. Figure 2.6, a scatter plot of mpg against weightlbs, seems to have
netted two outliers.
Most of the data points cluster together along the horizontal axis, except for two
outliers. The one on the left is the same vehicle we identified in Figure 2.6, weighing
only 192.5 pounds. The outlier near the top is something new: a car that gets over
500 miles per gallon! Clearly, unless this vehicle runs on dilithium crystals, we are
looking at a data entry error.
Note that the 192.5 pound vehicle is an outlier with respect to weight but
not with respect to mileage. Similarly, the 500-mpg car is an outlier with respect
to mileage but not with respect to weight. Thus, a record may be an outlier in a
particular dimension but not in another. We shall examine numeric methods for
identifying outliers, but we need to pick up a few tools first.

2.6 MEASURES OF CENTER AND SPREAD

Suppose that we are interested in estimating where the center of a particular variable
lies, as measured by one of the numerical measures of center, the most common

3 See the Appendix for more on scatter plots.

24 CHAPTER 2 DATA PREPROCESSING

600

500

400
mpg

300

200

100

0 1000 2000 3000 4000 5000

weight
Figure 2.6 Scatter plot of mpg against Weightlbs shows two outliers.

of which are the mean, median, and mode. Measures of center are a special case
of measures of location, numerical summaries that indicate where on a number line
a certain characteristic of the variable lies. Examples of measures of location are
percentiles and quantiles.
The mean of a variable is simply the average of the valid values taken by
the variable. To find the mean, simply add up all the field values and divide by the
sample size. Here we introduce a bit of notation. The sample mean is denoted as
∑ ∑
x̄ (“x-bar”) and is computed as x̄ = x/n, where (capital sigma, the Greek letter
“S,” for “summation”) represents “sum all the values” and n represents the sample
size. For example, suppose that we are interested in estimating where the center of
the customer service calls variable lies from the churn data set that we will explore
in Chapter 3. IBM/SPSS Modeler supplies us with the statistical summaries shown
in Figure 2.7. The mean number of customer service calls for this sample of n =
3333 customers is given as x̄ = 1.563. Using the sum and the count statistics, we can
verify that
∑
x 5209
x̄ = = = 1.563
n 3333

For variables that are not extremely skewed, the mean is usually not too far
from the variable center. However, for extremely skewed data sets, the mean becomes
less representative of the variable center. Also, the mean is sensitive to the presence of
outliers. For this reason, analysts sometimes prefer to work with alternative measures
of center, such as the median, defined as the field value in the middle when the field
values are sorted into ascending order. The median is resistant to the presence of
outliers. Other analysts may prefer to use the mode, which represents the field value
occurring with the greatest frequency. The mode may be used with either numerical
or categorical data, but is not always associated with the variable center.
2.6 MEASURES OF CENTER AND SPREAD 25

Customer Service Calls

Statistics
Count 3333
Mean 1.563
Sum 5209.000
Median 1
Mode 1
Figure 2.7 Statistical summary of customer service calls.

Note that measures of center do not always concur as to where the center of the
data set lies. In Figure 2.7, the median is 1, which means that half of the customers
made at least one customer service call; the mode is also 1, which means that the
most frequent number of customer service calls was 1. The median and mode agree.
However, the mean is 1.563, which is 56.3% higher than the other measures. This is
due to the mean’s sensitivity to the right-skewness of the data.
Measures of location are not sufficient to summarize a variable effectively. In
fact, two variables may have the very same values for the mean, median, and mode,
and yet have different natures. For example, suppose that stock portfolio A and stock
portfolio B contained five stocks each, with the price/earnings (P/E) ratios as shown
in Table 2.3. The portfolios are distinctly different in terms of P/E ratios. Portfolio A
includes one stock that has a very small P/E ratio and another with a rather large P/E
ratio. On the other hand, portfolio B’s P/E ratios are more tightly clustered around the
mean. But despite these differences, the mean, median, and mode of the portfolios,
P/E ratios are precisely the same: The mean P/E ratio is 10, the median is 11, and the
mode is 11 for each portfolio.
Clearly, these measures of center do not provide us with a complete picture.
What is missing are measures of spread or measures of variability, which will describe
how spread out the data values are. Portfolio A’s P/E ratios are more spread out than
those of portfolio B, so the measures of variability for portfolio A should be larger
than those of B.
Typical measures of variability include the range (maximum − minimum),
the standard deviation, the mean absolute deviation, and the interquartile range. The

TABLE 2.3 The two portfolios have the same mean,

median, and mode, but are clearly different

Stock Portfolio A Stock Portfolio B

1 7
11 8
11 11
11 11
16 13
26 CHAPTER 2 DATA PREPROCESSING

sample standard deviation is perhaps the most widespread measure of variability and
is defined by
√
∑
(x − x̄ )2
s=
n−1
Because of the squaring involved, the standard deviation is sensitive to the
presence of outliers, leading analysts to prefer other measures of spread, such as the
mean absolute deviation, in situations involving extreme values.
The standard deviation can be interpreted as the “typical” distance between a
field value and the mean, and most field values lie within two standard deviations of
the mean. From Figure 2.7 we can state that the number of customer service calls
made by most customers lies within 2(1.315) = 2.63 of the mean of 1.563 calls. In
other words, most of the number of customer service calls lie within the interval
(−1.067, 4.193), that is, (0, 4). (This can be verified by examining the histogram of
customer service calls in Figure 3.14 in Chapter 3.)
More information about these statistics may be found in the Appendix. A more
complete discussion of measures of location and variability can be found in any
introductory statistics textbook, such as Larose [2].

2.7 DATA TRANSFORMATION

Variables tend to have ranges that vary greatly from each other. For example, if we
are interested in major league baseball, players’ batting averages will range from zero
to less than 0.400, while the number of home runs hit in a season will range from
zero to around 70. For some data mining algorithms, such differences in the ranges
will lead to a tendency for the variable with greater range to have undue influence
on the results. That is, the greater variability in home runs will dominate the lesser
variability in batting averages.
Therefore, data miners should normalize their numeric variables, in order to
standardize the scale of effect each variable has on the results. Neural networks benefit
from normalization, as do algorithms that make use of distance measures, such as the
k-nearest neighbor algorithm. There are several techniques for normalization, and we
shall examine three of the more prevalent methods. Let X refer to our original field
value, and X∗ refer to the normalized field value.

2.8 MIN-MAX NORMALIZATION

Min-max normalization works by seeing how much greater the field value is than the
minimum value min(X), and scaling this difference by the range. That is

∗ X − min(X) X − min(X)
Xmm = =
range(X) max(X) − min(X)
The summary statistics for weight are shown in Figure 2.8.
2.9 Z-SCORE STANDARDIZATION 27

Figure 2.8 Summary statistics for weight.

The minimum weight is 1613 pounds, and the range = max (X) − min (X) =
4997 − 1613 = 3384 pounds. Let us find the min-max normalization for three auto-
mobiles weighing 1613 pounds, 3384 pounds, and 4997, respectively.
r For an ultra-light vehicle, weighing only 1613 pounds (the field minimum), the
min-max normalization is
∗ X − min(X) 1613 − 1613
Xmm = = =0
range(X) 3384
Thus, data values that represent the minimum for the variable will have
a min-max normalization value of zero.
r The midrange equals the average of the maximum and minimum values in a
data set. That is,
max(X) + min(X) 4997 + 1613
midrange(X) = = = 3305 pounds
2 2
For a “midrange” vehicle (if any), which weighs exactly halfway between
the minimum weight and the maximum weight, the min-max normalization is
∗ X − min(X) 3305 − 1613
Xmm = = = 0.5
range(X) 3384
So the midrange data value has a min-max normalization value of 0.5.
r The heaviest vehicle has a min-max normalization value of
∗ X − min(X) 4497 − 1613
Xmm = = =1
range(X) 3384
That is, data values representing the field maximum will have a min-max nor-
malization of 1. To summarize, min-max normalization values will range from 0 to 1.

2.9 Z-SCORE STANDARDIZATION

Z-score standardization, which is very widespread in the world of statistical analysis,

works by taking the difference between the field value and the field mean value, and
scaling this difference by the standard deviation of the field values. That is
X − mean(X)
Z-score =
SD(X)
28 CHAPTER 2 DATA PREPROCESSING

Figure 2.8 tells us that mean(weight) = 3005.49 and SD(weight) = 852.49.

r For the vehicle weighing only 1613 pounds, the Z-score standardization is
X − mean(X) 1613 − 3005.49
Z-score = = ≈ −1.63
SD(X) 852.49
Thus, data values that lie below the mean will have a negative Z-score
standardization.
r For an “average” vehicle (if any), with a weight equal to mean(X) = 3005.49
pounds, the Z-score standardization is
X − mean(X) 3005.49 − 3005.49
Z-score = = = 0.
SD(X) 852.49
That is, values falling exactly on the mean will have a Z-score standard-
ization of zero.
r For the heaviest car, the Z-score standardization is
X − mean(X) 4997 − 3005.49
Z-score = = ≈ 2.34.
SD(X) 852.49
That is, data values that lie above the mean will have a positive Z-score
standardization.4

2.10 DECIMAL SCALING

Decimal scaling ensures that every normalized value lies between −1 and 1.
∗ X
Xdecimal =
10d
where d represents the number of digits in the data value with the largest absolute
value. For the weight data, the largest absolute value is |4997| = 4997, which has
d = 4 digits. The decimal scaling for the minimum and maximum weights is
∗ 1613 ∗ 4997
Min: Xdecimal = = 0.1613 Max: Xdecimal = = 0.4997
104 104

2.11 TRANSFORMATIONS TO ACHIEVE NORMALITY

Some data mining algorithms and statistical methods require that the variables be
normally distributed. The normal distribution is a continuous probability distribution
commonly known as the bell curve, which is symmetric. It is centered at mean 𝜇
(“myu”) and has its spread determined by standard deviation 𝜎 (sigma). Figure 2.9
shows the normal distribution that has mean 𝜇 = 0 and standard deviation 𝜎 = 1,
known as the standard normal distribution Z.
4 Also, for a given Z-score, we may find its associated data value. See the Appendix.
2.11 TRANSFORMATIONS TO ACHIEVE NORMALITY 29

Spread determined by σ

–3 –2 –1 0 1 2 3
Mean μ
Figure 2.9 Standard normal Z distribution.

It is a common misconception that variables that have had the Z-score standard-
ization applied to them follow the standard normal Z distribution. This is not correct!
It is true that the Z-standardized data will have mean 0 and standard deviation 1 (see
Figure 2.14), but the distribution may still be skewed. Compare the histogram of
the original weight data in Figure 2.10 with the Z-standardized data in Figure 2.11.
Both histograms are right-skewed; in particular, Figure 2.10 is not symmetric, and so
cannot be normally distributed.
We use the following statistic to measure the skewness of a distribution5 :
3 (mean − median)
Skewness =
standard deviation
For right-skewed data, the mean is greater than the median, and thus the skew-
ness will be positive (Figure 2.12), while for left-skewed data, the mean is smaller

15
Count

0
1000 2000 3000 4000 5000
weightlbs
Figure 2.10 Original data.

5 Find more about standard deviations in the Appendix.

30 CHAPTER 2 DATA PREPROCESSING

15
Count

0
–2 –1 0 1 2 3
weight_z
Figure 2.11 Z-Standardized data are still right-skewed, not normally distributed.

Median Mean
Figure 2.12 Right-skewed data have positive skewness.

than the median, generating negative values for skewness (Figure 2.13). For perfectly
symmetric data (such as in Figure 2.9) of course, the mean, median, and mode are all
equal, and so the skewness equals zero.
Much real-world data are right-skewed, including most financial data. Left-
skewed data are not as common, but often occurs when the data are right-censored,
such as test scores on an easy test, which can get no higher than 100. We use the
statistics for weight and weight_Z shown in Figure 2.14 to calculate the skewness for
these variables.
For weight we have

3 (mean − median) 3(3005.490 − 2835)

Skewness = = = 0.6
standard deviation 852.646
2.11 TRANSFORMATIONS TO ACHIEVE NORMALITY 31

Mean Median
Figure 2.13 Left-skewed data have negative skewness.

For weight_Z we have

3 (mean − median) 3(0 − (−0.2))

Skewness = = = 0.6
standard deviation 1

Thus, Z-score standardization has no effect on skewness.

To make our data “more normally distributed,” we must first make it sym-
metric, which means eliminating the skewness. To eliminate skewness, we apply a
transformation to the data. Common transformations √ are the natural log transfor-
mation ln(weight), the square root transformation weight, and the inverse square
√
root transformation 1∕ weight. Application of the square root transformation (Fig-
ure 2.15) somewhat reduces the skewness, while applying the ln transformation
(Figure 2.16) reduces skewness even further.

Figure 2.14 Statistics for calculating skewness.

32 CHAPTER 2 DATA PREPROCESSING

15
Count

0
40 50 60 70 80
sqrt(weight)
Figure 2.15 Square root transformation somewhat reduces skewness.

The statistics in Figure 2.17 is used to calculate the reduction in skewness:

3 (54.280 − 53.245)
Skewness(sqrt (weight)) = ≈ 0.40
7.709
3 (7.968 − 7.950)
Skewness(ln (weight)) = ≈ 0.19
0.284
√
Finally, we try the inverse square root transformation 1∕ weight, which gives
us the distribution in Figure 2.18. The statistics in Figure 2.19 gives us
3 (0.019 − 0.019)
Skewness (inverse sqrt (weight)) = =0
0.003
which indicates that we have eliminated the skewness and achieved a symmetric
distribution.

15
Count

0
7.25 7.50 7.75 8.00 8.25 8.50
In(weight)
Figure 2.16 Natural log transformation reduces skewness even further.

Warrior Trendline-V1
No ratings yet
Warrior Trendline-V1
23 pages
Business Intelligence and Data Analytics - Week 2
No ratings yet
Business Intelligence and Data Analytics - Week 2
24 pages
Module 3 Descriptive Statistics Numerical Measures
No ratings yet
Module 3 Descriptive Statistics Numerical Measures
28 pages
3) Statistical Measures of Asset Returns
No ratings yet
3) Statistical Measures of Asset Returns
6 pages
Descriptive Statistics PDF
No ratings yet
Descriptive Statistics PDF
24 pages
Statistics ClassNotes - 2
No ratings yet
Statistics ClassNotes - 2
10 pages
Cba101 MT
No ratings yet
Cba101 MT
4 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Principles of Data Science WEB 5
No ratings yet
Principles of Data Science WEB 5
30 pages
3.3.1 Data Summarization
No ratings yet
3.3.1 Data Summarization
56 pages
Part 2-Chapter 3 - Describing Data - Edit
No ratings yet
Part 2-Chapter 3 - Describing Data - Edit
46 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
Stat
No ratings yet
Stat
16 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
Lecture 3 Summarizing Data Measures of Central Location and Sampling
No ratings yet
Lecture 3 Summarizing Data Measures of Central Location and Sampling
53 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Basic Statistics
No ratings yet
Basic Statistics
24 pages
MATH30 6 Lecture 3
No ratings yet
MATH30 6 Lecture 3
66 pages
Lecture 5 (Descriptive Statistics)
No ratings yet
Lecture 5 (Descriptive Statistics)
39 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Business Statistics: Session 2
No ratings yet
Business Statistics: Session 2
60 pages
Describing Data Using Numerical Measures: Chapter Goals
No ratings yet
Describing Data Using Numerical Measures: Chapter Goals
20 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
Chapter 5 Statistics and Data
No ratings yet
Chapter 5 Statistics and Data
25 pages
Week 3
No ratings yet
Week 3
37 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
MMW 6 Data Management Part 3 Central Location Variability PDF
No ratings yet
MMW 6 Data Management Part 3 Central Location Variability PDF
5 pages
Describing Data Numerical
No ratings yet
Describing Data Numerical
53 pages
Statistics
No ratings yet
Statistics
10 pages
Statistics 2025
No ratings yet
Statistics 2025
46 pages
Statistical Measures
No ratings yet
Statistical Measures
54 pages
EECM3724 Unit 1 Ch3 Slides 2022
No ratings yet
EECM3724 Unit 1 Ch3 Slides 2022
48 pages
Probability Statistics Lecture 4
No ratings yet
Probability Statistics Lecture 4
69 pages
M6 - Basic Statistics
No ratings yet
M6 - Basic Statistics
66 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
14 pages
Lecture 6
No ratings yet
Lecture 6
84 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
13 pages
Notes Stats Quiz 2
No ratings yet
Notes Stats Quiz 2
10 pages
Lecture 3
No ratings yet
Lecture 3
14 pages
Data Reduction and Descriptive Statistics: Business Mathematics (IBA) October 2023
No ratings yet
Data Reduction and Descriptive Statistics: Business Mathematics (IBA) October 2023
8 pages
Lecture 3 - Numerical Statistics
No ratings yet
Lecture 3 - Numerical Statistics
7 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
Chapter 4 Measures of Central Tendency
No ratings yet
Chapter 4 Measures of Central Tendency
7 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Session 2
No ratings yet
Session 2
14 pages
Averages and Variation Eda
No ratings yet
Averages and Variation Eda
29 pages
Ec310 Day 2 Lecture Notes
No ratings yet
Ec310 Day 2 Lecture Notes
10 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
Chapter 3
No ratings yet
Chapter 3
26 pages
Ge8 Statistics
No ratings yet
Ge8 Statistics
2 pages
Summary Measures
No ratings yet
Summary Measures
26 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
Lesson 3.2 Measures of Central Tendency Position and Variation
No ratings yet
Lesson 3.2 Measures of Central Tendency Position and Variation
62 pages
Descriptive Statistics PDF
100% (1)
Descriptive Statistics PDF
40 pages
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
From Everand
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
Wouter Verbeke
No ratings yet
High-Dimensional Covariance Estimation: With High-Dimensional Data
From Everand
High-Dimensional Covariance Estimation: With High-Dimensional Data
Mohsen Pourahmadi
No ratings yet
URT2 Advanced Participant Guide v4.0
No ratings yet
URT2 Advanced Participant Guide v4.0
20 pages
Mathematics 9758/01: Higher 2 Candidate Name Tutorial/ Index Form Class Number
No ratings yet
Mathematics 9758/01: Higher 2 Candidate Name Tutorial/ Index Form Class Number
42 pages
x=^μ= x n x x) n−1 s σ s x−μ σ se (´x) = σ n: Sample Mean
No ratings yet
x=^μ= x n x x) n−1 s σ s x−μ σ se (´x) = σ n: Sample Mean
5 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
5 pages
May 2018 PDF
No ratings yet
May 2018 PDF
4 pages
Final Term Mathematics in The Modern World
No ratings yet
Final Term Mathematics in The Modern World
43 pages
Syllabus: Paper - CH 2.1: Semester - II Business Statistics
No ratings yet
Syllabus: Paper - CH 2.1: Semester - II Business Statistics
3 pages
004DLL - Stat - NORMAL dISTRIBUTION
No ratings yet
004DLL - Stat - NORMAL dISTRIBUTION
10 pages
Statistics Assignment of B.A Psychology IGNOU
No ratings yet
Statistics Assignment of B.A Psychology IGNOU
9 pages
1.1 - Statistical Analysis PDF
No ratings yet
1.1 - Statistical Analysis PDF
10 pages
G4 MT2152643 Written Report
No ratings yet
G4 MT2152643 Written Report
28 pages
Jackknife Bias Estimator, Standard Error and Pseudo-Value
No ratings yet
Jackknife Bias Estimator, Standard Error and Pseudo-Value
14 pages
Six Sigma Green Belt Sample Questions: 1. Which Is The Following Is Not True About "Sigma"?
No ratings yet
Six Sigma Green Belt Sample Questions: 1. Which Is The Following Is Not True About "Sigma"?
4 pages
Ipm Final Model Paper Spring 15
No ratings yet
Ipm Final Model Paper Spring 15
12 pages
Network Traffic Classification With Improved Random Forest
No ratings yet
Network Traffic Classification With Improved Random Forest
4 pages
Business Statistics in Practices Chap - 08
No ratings yet
Business Statistics in Practices Chap - 08
30 pages
Effects of Oral Cannabidiol On Health and Fitness
No ratings yet
Effects of Oral Cannabidiol On Health and Fitness
13 pages
Measuring Historic Volatility
No ratings yet
Measuring Historic Volatility
12 pages
Part 4 Modeling Profitability Instead of Default Coursera 1
No ratings yet
Part 4 Modeling Profitability Instead of Default Coursera 1
11 pages
Learning Activity Answers OEd STAT
No ratings yet
Learning Activity Answers OEd STAT
8 pages
E1490-11 Standard Guide For Two Sensory Descriptive
No ratings yet
E1490-11 Standard Guide For Two Sensory Descriptive
40 pages
The Normal Distribution: Sue Gordon
No ratings yet
The Normal Distribution: Sue Gordon
40 pages
ITG ACE Agency Cost Estimator - A Model Description
No ratings yet
ITG ACE Agency Cost Estimator - A Model Description
70 pages
MACS - 11º Ano 4 Aula: Prof. Álvaro Velosa
No ratings yet
MACS - 11º Ano 4 Aula: Prof. Álvaro Velosa
16 pages
Process Capability
No ratings yet
Process Capability
14 pages
BBS 4th Year Report
No ratings yet
BBS 4th Year Report
46 pages
Modul 1 Statistics
100% (1)
Modul 1 Statistics
12 pages
PMP® Exam Self-Assessment Test: Test Time: 05:27:18 - Total Test Time: 90 Min
No ratings yet
PMP® Exam Self-Assessment Test: Test Time: 05:27:18 - Total Test Time: 90 Min
40 pages

Discovering Knowledge in Data - 2014 - Larose - 2 - Partie2

Uploaded by

Discovering Knowledge in Data - 2014 - Larose - 2 - Partie2

Uploaded by

2.

6 MEASURES OF CENTER AND SPREAD 23

2.6 MEASURES OF CENTER AND SPREAD

3 See the Appendix for more on scatter plots.

0 1000 2000 3000 4000 5000

Customer Service Calls

TABLE 2.3 The two portfolios have the same mean,

Stock Portfolio A Stock Portfolio B

2.7 DATA TRANSFORMATION

2.8 MIN-MAX NORMALIZATION

Figure 2.8 Summary statistics for weight.

2.9 Z-SCORE STANDARDIZATION

Z-score standardization, which is very widespread in the world of statistical analysis,

Figure 2.8 tells us that mean(weight) = 3005.49 and SD(weight) = 852.49.

2.10 DECIMAL SCALING

2.11 TRANSFORMATIONS TO ACHIEVE NORMALITY

5 Find more about standard deviations in the Appendix.

3 (mean − median) 3(3005.490 − 2835)

For weight_Z we have

3 (mean − median) 3(0 − (−0.2))

Thus, Z-score standardization has no effect on skewness.

Figure 2.14 Statistics for calculating skewness.

The statistics in Figure 2.17 is used to calculate the reduction in skewness:

You might also like