Get Exploratory Multivariate Analysis by Example Using R Second Edition Husson Free All Chapters
Get Exploratory Multivariate Analysis by Example Using R Second Edition Husson Free All Chapters
https://ebookfinal.com/download/data-analysis-and-graphics-using-r-an-
example-based-approach-third-edition-john-maindonald/
https://ebookfinal.com/download/sas-functions-by-example-second-
edition-ron-cody/
https://ebookfinal.com/download/think-stats-exploratory-data-analysis-
second-edition-allen-b-downey/
https://ebookfinal.com/download/applied-multivariate-data-analysis-
second-edition-brian-s-everitt/
Using Multivariate Statistics Barbara G. Tabachnick
https://ebookfinal.com/download/using-multivariate-statistics-barbara-
g-tabachnick/
https://ebookfinal.com/download/exploratory-network-analysis-with-
pajek-wouter-de-nooy/
https://ebookfinal.com/download/handbook-of-univariate-and-
multivariate-data-analysis-with-ibm-spss-second-edition-robert-ho/
https://ebookfinal.com/download/sfml-game-development-by-example-
create-and-develop-exciting-games-from-start-to-finish-using-sfml-
pupius/
https://ebookfinal.com/download/structural-aspects-in-the-theory-of-
probability-second-edition-series-on-multivariate-analysis-herbert-
heyer/
Exploratory Multivariate Analysis by Example Using R
Second Edition Husson Digital Instant Download
Author(s): Husson, François; Lê, Sébastien; Pagès, Jérôme
ISBN(s): 9781315301860, 1315301865
Edition: Second edition
File Details: PDF, 18.05 MB
Year: 2017
Language: english
Chapman & Hall/CRC
Computer Science and Data Analysis Series
Exploratory Multivariate
Analysis by Example Using R
François Husson
Sébastien Lê
Jérôme Pagès
The interface between the computer and statistical sciences is increasing, as each
discipline seeks to harness the power and resources of the other. This series aims to
foster the integration between the computer sciences and statistical, numerical, and
probabilistic methods by publishing a broad range of reference works, textbooks, and
handbooks.
SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London
Proposals for the series should be sent directly to one of the series editors above, or submitted to:
Published Titles
®
Computational Statistics Handbook with MATLAB , Third Edition
Wendy L. Martinez and Angel R. Martinez
R Graphics
Paul Murrell
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xi
v
vi Contents
4 Clustering 173
4.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.2 Formalising the Notion of Similarity . . . . . . . . . . . . . . 177
4.2.1 Similarity between Individuals . . . . . . . . . . . . . 177
4.2.1.1 Distances and Euclidean Distances . . . . . . 177
4.2.1.2 Example of Non-Euclidean Distance . . . . . 178
4.2.1.3 Other Euclidean Distances . . . . . . . . . . 179
4.2.1.4 Similarities and Dissimilarities . . . . . . . . 179
4.2.2 Similarity between Groups of Individuals . . . . . . . 180
4.3 Constructing an Indexed Hierarchy . . . . . . . . . . . . . . 181
4.3.1 Classic Agglomerative Algorithm . . . . . . . . . . . . 181
4.3.2 Hierarchy and Partitions . . . . . . . . . . . . . . . . . 183
4.4 Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.4.1 Partition Quality . . . . . . . . . . . . . . . . . . . . . 184
4.4.2 Agglomeration According to Inertia . . . . . . . . . . 185
4.4.3 Two Properties of the Agglomeration Criterion . . . . 187
4.4.4 Analysing Hierarchies, Choosing Partitions . . . . . . 188
4.5 Direct Search for Partitions: K-Means Algorithm . . . . . . 189
4.5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . 189
4.5.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . 191
4.6 Partitioning and Hierarchical Clustering . . . . . . . . . . . . 191
4.6.1 Consolidating Partitions . . . . . . . . . . . . . . . . . 192
4.6.2 Mixed Algorithm . . . . . . . . . . . . . . . . . . . . . 192
4.7 Clustering and Principal Component Methods . . . . . . . . 192
4.7.1 Principal Component Methods Prior to AHC . . . . . 193
4.7.2 Simultaneous Analysis of a Principal Component Map
and Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 193
4.8 Clustering and Missing Data . . . . . . . . . . . . . . . . . . 194
4.9 Example: The Temperature Dataset . . . . . . . . . . . . . . 194
4.9.1 Data Description — Issues . . . . . . . . . . . . . . . 194
4.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 195
4.9.3 Implementation of the Analysis . . . . . . . . . . . . . 195
4.10 Example: The Tea Dataset . . . . . . . . . . . . . . . . . . . 199
4.10.1 Data Description — Issues . . . . . . . . . . . . . . . 199
4.10.2 Constructing the AHC . . . . . . . . . . . . . . . . . . 201
4.10.3 Defining the Clusters . . . . . . . . . . . . . . . . . . . 202
4.11 Dividing Quantitative Variables into Classes . . . . . . . . . 204
5 Visualisation 209
5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.2 Viewing PCA Data . . . . . . . . . . . . . . . . . . . . . . . 209
5.2.1 Selecting a Subset of Objects — Cloud of Individuals 210
5.2.2 Selecting a Subset of Objects — Cloud of Variables . . 211
5.2.3 Adding Supplementary Information . . . . . . . . . . 212
x Contents
Appendix 225
A.1 Percentage of Inertia Explained by the First Component or by
the First Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 225
A.2 R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
A.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 230
A.2.2 The Rcmdr Package . . . . . . . . . . . . . . . . . . . 234
A.2.3 The FactoMineR Package . . . . . . . . . . . . . . . . 236
Bibliography 243
Index 245
Preface
xi
xii Preface
each chapter on managing missing data, which will enable users to conduct
analyses from incomplete tables more easily.
The authors would like to thank Rebecca Clayton for her help in the transla-
tion.
1
Principal Component Analysis (PCA)
1.2 Objectives
The data table can be considered either as a set of rows (individuals) or as a
set of columns (variables), thus raising a number of questions relating to these
different types of objects.
2 Exploratory Multivariate Analysis by Example Using R
TABLE 1.1
Some Examples of Datasets
Field Individuals Variables xik
Ecology Rivers Concentration of pollutants Concentration of pollu-
tant k in river i
Economics Years Economic indicators Indicator value k for year
i
Genetics Patients Genes Expression of gene k for
patient i
Marketing Brands Measures of satisfaction Value of measure k for
brand i
Pedology Soils Granulometric composition Content of component k
in soil i
Biology Animals Measurements Measure k for animal i
TABLE 1.2
The Orange Juice Data
Odour Odour Pulp Intensity Acidity Bitter- Sweet-
intensity typicality of taste ness ness
Pampryl amb. 2.82 2.53 1.66 3.46 3.15 2.97 2.60
Tropicana amb. 2.76 2.82 1.91 3.23 2.55 2.08 3.32
Fruvita fr. 2.83 2.88 4.00 3.45 2.42 1.76 3.38
Joker amb. 2.76 2.59 1.66 3.37 3.05 2.56 2.80
Tropicana fr. 3.20 3.02 3.69 3.12 2.33 1.97 3.34
Pampryl fr. 3.07 2.73 3.34 3.54 3.31 2.63 2.90
2
1.0
l l l
l l
ll l l
1.0
l l l l l
l
l l
ll l
ll l l
ll
ll
l l l
l l l
0.5
1
l l l
l l
0.5
l ll l
l
l l
l l ll
l l
l
Variable k
Variable k
Variable k
l l
l l
l
l ll
0.0
0.0
0
l
l l l
l l l l
l
l
l l
l l
l l
l l
−0.5
l
l
−0.5
l
−1
l l l
l l
l l
l
l
l
l l l l l
l l
−1.0
l l l
l
l ll
l l l
l l l
−1.0
l l
l
−2
l
l l l
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 3
Variable j Variable j Variable j
FIGURE 1.1
Representation of 40 individuals described by two variables: j and k.
very unusual variables and intermediate variables, which are to some extent
linked to both groups. In the example, each group can be represented by one
single variable as the variables within each group are very strongly correlated.
We refer to these variables as synthetic variables.
A B C
1.0
l
l l l
l l l
l l l
0.0
0.0
l
l
l ll l l l l
ll l l
l
l l l l l l l
l l l
l l
l l l l
l l l l l
l l l
l
ll l l l l
l l l l
l l l l
l l l l l
l l l
Variable k
l l l
Variable l
Variable l
l l
l l l l
l l
l l l
l
0.0
l l l
l l l l l
l
l ll ll l l
l ll l ll
l l l l
l ll l l l
l l
l l
l l l
l l
l l l l l l
ll l l
l l l l
l ll l l l l
−0.5
l l l l
l l
l l
l
l l l l
l l l
l l
l l l l
l l l l
−0.8
−0.8
l l l l l l
l l l
l
l ll
l l l l
l l l l l l l
−1.0
l l l l
l l l l l l l l
l l l l
l l
l l l l
−1.0
−1.0
l l l
l l l l l
l l
l l l
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Variable j Variable j Variable k
D E F
l l l
l l l
l l l
0.2 0.4 0.6 0.8 1.0
Variable m
Variable m
l l l
l l l
l l l l l l
ll l l l l l l l
l l l l l l
l l l l l l
l l l l ll l l l
l l l
l l l l l l
l l l
l l l
l l l
l l l l l l
l l l l l l
l l l
l l l
l l l l l l
l l l l l l
l l l
l l l l l l l l l
l l l l l l
l l l
l l l l l l
l l l
l l l l l l
l l l l l l
l l l
l l l
−0.2
−0.2
−0.2
l l l
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.8 −0.6 −0.4 −0.2 0.0
Variable j Variable k Variable l
FIGURE 1.2
Representation of the relationships between four variables: j, k, l, and m,
taken two-by-two.
the strongly correlated variables into sets, but even for this reduced number
of variables, grouping them this way is tedious.
TABLE 1.3
Orange Juice Data: Correlation Matrix
Odour Odour Pulp Intensity Acidity Bitter- Sweet-
intensity typicality of taste ness ness
Odour intensity 1.00 0.58 0.66 −0.27 −0.15 −0.15 0.23
Odour typicality 0.58 1.00 0.77 −0.62 −0.84 −0.88 0.92
Pulp content 0.66 0.77 1.00 −0.02 −0.47 −0.64 0.63
Intensity of taste −0.27 −0.62 −0.02 1.00 0.73 0.51 −0.57
Acidity −0.15 −0.84 −0.47 0.73 1.00 0.91 −0.90
Bitterness −0.15 −0.88 −0.64 0.51 0.91 1.00 −0.98
Sweetness 0.23 0.92 0.63 −0.57 −0.90 −0.98 1.00
If two individuals have similar values within the table of all K variables, they
are also close in the space RK . Thus, the study of the data table can be
conducted geometrically by studying the distances between individuals. We
are therefore interested in all of the individuals in RK , that is, the cloud
of individuals (denoted NI ). Analysing the distances between individuals is
therefore tantamount to studying the shape of the cloud of points. Figure 1.3
illustrates a cloud of points within a space RK for K = 3.
FIGURE 1.3
Flight of a flock of starlings illustrating a scatterplot in RK .
The shape of the cloud NI remains the same even when translated. The
data are also centred, which corresponds to considering xik − x̄k rather than
xik . Geometrically, this is tantamount to coinciding the centre of mass of the
cloud GI (with coordinates x̄k for k = 1, ..., K) with the origin of reference (see
Figure 1.4). Centring presents technical advantages and is always conducted
in PCA.
The operation of reduction (also referred to as standardising), which con-
sists of considering (xik − x̄k )/sk rather than xik , modifies the shape of the
cloud by harmonising its variability in all the directions of the original vectors
(i.e., the K variables). Geometrically, it means choosing standard deviation
Principal Component Analysis 7
FIGURE 1.4
Scatterplot of the individuals in RK .
FIGURE 1.5
Two-dimensional representations of fruits: from left to right, an avocado, a
melon, and a banana; each row corresponds to a different representation.
The convention for notation uses mechanical terms: O is the centre of gravity,
OHi is a vector, and the criterion is the inertia of the projection of NI . The
criterion which consists of increasing the variance of the projected points to a
maximum is perfectly appropriate.
Principal Component Analysis 9
Remark
If the individuals are weighted with different weights pi , the maximised crite-
PI
rion is i=1 pi OHi2 .
In some rare cases, it might be interesting to search for the best axial
representation of cloud NI alone.
PI This best axis is obtained in the same way:
find the component u1 when i=1 OHi2 are maximum (where Hi is the pro-
jection of i on u1 ). It can be shown that plane P contains component u1 (the
“best” plane contains the “best” component): in this case, these representa-
tions are said to be nested. An illustration of this property is presented in
Figure 1.6. Planets, which are in a three-dimensional space, are traditionally
represented on a component. This component determines their positions as
well as possible in terms of their distances from one other (in terms of inertia
of the projected cloud). We can also represent planets on a plane according
to the same principle: to maximise the inertia of the projected scatterplot
(on the plane). This best plane representation also contains the best axial
representation.
ne
Su ury
s
r
rn
nu
te
tu
o
c
tu
ut
pi
n
ra
ep
er
Sa
Pl
Ju
U
M
N
M h
Ve s
s
rt
ar
nu
Ea
Uranus
Mars
Saturn Earth Sun
Mercury Venus Neptune
Jupiter
Pluto
FIGURE 1.6
The best axial representation is nested in the best plane representation of the
solar system (18 February 2008).
Remark
When variables are centred but not standardised, the matrix to be diago-
nalised is the variance–covariance matrix.
1.3.2.4 Example
The distance between two orange juices is calculated using their seven sensory
descriptors. We decided to standardise the data to attribute each descriptor
equal influence. Figure 1.7 is obtained from the first two components of the
PCA and corresponds to the best plane for representing the cloud of individu-
als in terms of projected inertia. The inertia projected on the plane is the sum
of two eigenvalues, that is, 86.82% (= 67.77% + 19.05%) of the total inertia
of the cloud of points.
The first principal component, that is, the principal axis of variability
Principal Component Analysis 11
Pampryl fr.
2
Dim 2 (19.05%)
1
Tropicana fr.
Fruvita fr.
0
Pampryl amb.
-1
Joker amb.
Tropicana amb.
-2
-4 -2 0 2 4
Dim 1 (67.77%)
FIGURE 1.7
Orange juice data: plane representation of the scatterplot of individuals.
between the orange juices, separates the two orange juices Tropicana fr. and
Pampryl amb. According to data Table 1.2, we can see that these orange
juices are the most extreme in terms of the descriptors odour typicality and
bitterness: Tropicana fr. is the most typical and the least bitter while Pampryl
amb. is the least typical and the most bitter. The second component, that
is, the property that separates the orange juices most significantly once the
main principal component of variability has been removed, identifies Tropicana
amb., which is the least intense in terms of odour, and Pampryl fr., which is
among the most intense (see Table 1.2).
Reading this data is tedious when there are a high number of individuals
and variables. For practical purposes, we will facilitate the characterisation
of the principal components by using the variables more directly.
highest coordinate on component 1, has high values for odour typicality and
sweetness and low values for the variables acidic and bitter. Similarly, we
can examine the correlations between F2 and the variables. It may be noted
that the correlations are generally lower (in absolute value) than those with
the first principal component. We will see that this is directly linked to the
percentage of inertia associated with F2 which is, by construction, lower than
that associated with F1 . The second component can be characterised by the
variables odour intensity and pulp content (see Table 1.4).
TABLE 1.4
Orange Juice Data: Correlation between
Variables and First Two Components
F1 F2
Odour intensity 0.46 0.75
Odour typicality 0.99 0.13
Pulp content 0.72 0.62
Intensity of taste −0.65 0.43
Acidity −0.91 0.35
Bitterness −0.93 0.19
Sweetness 0.95 −0.16
Odour intensity
0.62 Pulpiness
Dimension 2 (19.05%)
Intensity of taste
0.5
Acidity
Bitterness
Odour typicality
0.0
0.72
Sweetness
-0.5
-1.0
FIGURE 1.8
Orange juice data: visualisation of the correlation coefficients between vari-
ables and the principal components F1 and F2 .
O
1
FIGURE 1.9
The scatterplot of the variables NK in RI . In the case of a standardised PCA,
the variables k are located within a hypersphere of radius 1.
with kkk and klk the norm for variable k and l, and θkl the angle produced by
the vectors representing variables k and l. Since the variables used here are
centred, the norm for one variable is equal to its standard deviation multiplied
by the square root of I, and the scalar product is expressed as follows:
I
X
(xik − x̄k ) × (xil − x̄l ) = I × sk × sl × cos(θkl ).
i=1
r(k, l) = cos(θkl ).
The above expression illustrates that vs is the new variable which is the most
strongly correlated with all of the initial variables K (with the orthogonality
constraint of vt already found). As a result, vs can be said to be a synthetic
variable. Here, we are experiencing the second aspect of the study of variables
(see Section 1.2.2).
Remark
When a variable is not standardised, its length is equal to its standard deviation.
In an unstandardised PCA, the criterion can be expressed as follows:
K
X K
X
2
(OHks ) = s2k r2 (k, vs ) .
k=1 k=1
This highlights the fact that, in the case of an unstandardised PCA, each
variable k is assigned a weight equal to its variance s2k .
A
HA
HB
HA
D HB HD
HD HC
HC
FIGURE 1.10
Projection of the scatterplot of the variables on the main plane of variabil-
ity. On the left: visualisation in space RI ; on the right: visualisation of the
projections in the principal plane.
K
1 X
Fs (i) = √ xik Gs (k),
λs k=1
Principal Component Analysis 17
I
1 X
Gs (k) = √ (1/I) xik Fs (i).
λs i=1
This result is essential for interpreting the data, and makes PCA a rich
and reliable experimental tool. This may be expressed as follows: individuals
are on the same side as their corresponding variables with high values, and
opposite their corresponding variables with low values. It must be noted that
xik are centred and carry both positive and negative values. This is one of the
reasons why individuals can be so far from the variable for which they carry
low values. Fs is referred to as the principal component of rank s; λs is the
variance of Fs and its square root the length of Fs in RI ; vs is known as the
standardised principal component.
The total inertias of both clouds are equal (and equal to K for standardised
PCA) and furthermore, when decomposed component by component, they
are identical. This property is remarkable: if S dimensions are enough to
perfectly represent NI , the same is true for NK . In this case, two dimensions
are sufficient. If not, why generate a third synthetic variable which would not
differentiate the individuals at all?
example, we say that the first component expresses three times more
variability than the second; it affects three times more variables but
this formulation is truly precise only when each variable is perfectly
correlated with a component.
Because of the orthogonality of the axes (both in RK and in RI ), these iner-
tia percentages can be added together for several components; in the example,
86.82% of the data are represented by the first two components (67.77% +
19.05% = 86.82%).
TABLE 1.5
Orange Juice Data: Decomposition of Variability
per Component
Eigenvalue Percentage of Cumulative
variance of variance percentage
Comp 1 4.74 67.77 67.77
Comp 2 1.33 19.05 86.81
Comp 3 0.82 11.71 98.53
Comp 4 0.08 1.20 99.73
Comp 5 0.02 0.27 100.00
Let us return to Figure 1.5: the pictures of the fruits on the first line cor-
respond approximately to a projection of the fruits on the plane constructed
by components 2 and 3 of PCA, whereas the images on the second line cor-
respond to a projection on plane 1-2. This is why the fruits are easier to
recognise on the second line: the more variability (i.e., the more information)
collected on plane 1-2 when compared to plane 2-3, the easier it is to grasp
the overall shape of the cloud. Furthermore, the banana is easier to recognise
in plane 1-2 (the second line), as it retrieves greater inertia on plane 1-2. In
concrete terms, as the banana is a longer fruit than a melon, this leads to
more marked differences in inertia between the components. As the melon is
almost spherical, the percentages of inertia associated with each of the three
components are around 33% and therefore the inertia retrieved by plane 1-2
is nearly 66% (as is that retrieved by plane 2-3).
Projected inertia of i on us
qlts (i) = = cos2 θis .
Total inertia of i
Using Pythagoras’ theorem, this indicator is combined for multiple compo-
nents and is most often calculated for a plane.
Principal Component Analysis 19
individuals are the most extreme on the component, and their contributions
are especially useful when the individuals’ weights are different.
Remark
These contributions are combined for multiple individuals.
TABLE 1.7
Orange Juice Data: Contribution of
Individuals to the Construction of the
Components
Dim.1 Dim.2
Pampryl amb. 31.29 0.08
Tropicana amb. 2.76 36.77
Fruvita fr. 13.18 0.02
Joker amb. 12.63 8.69
Tropicana fr. 35.66 4.33
Pampryl fr. 4.48 50.10
TABLE 1.8
Orange Juice Data: Contribution of Variables to the
Construction of the Components
Dim.1 Dim.2
Odour intensity 4.45 42.69
Odour typicality 20.47 1.35
Pulp content 10.98 28.52
Taste intensity 8.90 13.80
Acidity 17.56 9.10
Bitterness 18.42 2.65
Sweetness 19.22 1.89
1 X
Gs (k 0 ) = √ xik0 Fs (i) = r(k, Fs ),
λs
i∈{active}
where {active} refers to the set of active individuals. This coordinate is cal-
culated from the active individuals alone.
In the example, in addition to the sensory descriptors, there are also physic-
ochemical variables at our disposal (see Table 1.9). However, our stance re-
mains unchanged, namely, to describe the orange juices based on their sensory
profiles. This problem can be enriched using the supplementary variables since
we can now link sensory dimensions to the physicochemical variables.
TABLE 1.9
Orange Juice Data: Supplementary Variables
Glucose Fructose Saccharose Sweetening pH Citric Vitamin C
power acid
Pampryl amb. 25.32 27.36 36.45 89.95 3.59 0.84 43.44
Tropicana amb. 17.33 20.00 44.15 82.55 3.89 0.67 32.70
Fruvita fr. 23.65 25.65 52.12 102.22 3.85 0.69 37.00
Joker amb. 32.42 34.54 22.92 90.71 3.60 0.95 36.60
Tropicana fr. 22.70 25.32 45.80 94.87 3.82 0.71 39.50
Pampryl fr. 27.16 29.48 38.94 96.51 3.68 0.74 27.00
The correlations circle (Figure 1.11) represents both the active and sup-
plementary variables. The main component of variability opposes the orange
juices perceived as acidic/bitter, slightly sweet and somewhat typical with the
orange juices perceived as sweet, typical, not very acidic and slightly bitter.
The analysis of this sensory perception is reinforced by the variables pH and
saccharose. Indeed, these two variables are positively correlated with the first
component and lie on the side of the orange juices perceived as sweet and
22 Exploratory Multivariate Analysis by Example Using R
slightly acidic (a high pH index indicates low acidity). One also finds the re-
action known as “saccharose inversion” (or hydrolysis): the saccharose breaks
down into glucose and fructose in an acidic environment (the acidic orange
juices thus contain more fructose and glucose, and less saccharose than the
average).
1.0
Odour.intensity
Sweetening.power
Pulpiness
Intensity.of.taste
0.5
Acidity Fructose
Glucose Saccharose
Dim 2 (19.05%)
Bitterness
Odour.typicality
0.0
Citric.acid Sweetness
pH
Vitamin.C
-0.5
-1.0
Dim 1 (67.77%)
FIGURE 1.11
Orange juice data: representation of the active and supplementary variables.
Remark
When using PCA to explore data prior to a multiple regression, it is advisable
to choose the explanatory variables for the regression model as active variables
for PCA, and to project the variable to be explained (the dependent variable)
as a supplementary variable. This gives some idea of the relationships between
explanatory variables and thus of the need to select explanatory variables.
This also gives us an idea of the quality of the regression: if the dependent
variable is appropriately projected, it will be a well-fitted model.
TABLE 1.10
Orange Juice Data: Supplementary
Categorical Variables
Way of Origin
preserving
Pampryl amb. Ambient Other
Tropicana amb. Ambient Florida
Fruvita fr. Fresh Florida
Joker amb. Ambient Other
Tropicana fr. Fresh Florida
Pampryl fr. Fresh Other
Pampryl fr.
2
Dim 2 (19.05%)
Fresh
1
Tropicana fr.
Other
Fruvita fr.
0
Pampryl amb.
Ambient Florida
-1
Joker amb.
Tropicana amb.
-2
-4 -2 0 2 4
Dim 1 (67.77%)
FIGURE 1.12
Orange juice data: plane representation of the scatterplot of individuals with
a supplementary categorical variable.
24 Exploratory Multivariate Analysis by Example Using R
K
1 X
Fs (i0 ) = √ xi0 k Gs (k).
λs k=1
Note that centring and standardising (if any) are conducted with respect to
the averages and the standard deviations calculated from the active individuals
only. Moreover, the coordinate of i0 is calculated from the active variables
alone. Therefore, it is not necessary to have the values of the supplementary
individuals for the supplementary variables.
Comment
A supplementary categorical variable can be regarded as a supplementary
individual which, for each active variable, would take the average calculated
from all of the individuals with this categorical variable.
equal for each category). The correlation coefficients are sorted according to
the p-values in descending order for the positive coefficients and in ascending
order for the negative coefficients.
These tips for interpreting such data are particularly useful for understand-
ing those dimensions with a high number of variables.
The data used is made up of few variables. We shall nonetheless give the
outputs of the automatic description procedure for the first component as
an example. The variables which best characterise component 1 are odour
typicality, sweetness, bitterness, and acidity (see Table 1.11).
TABLE 1.11
Orange Juice Data: Description of the First Dimension
by the Quantitative Variables
Correlation p-value
Odour typicality 0.9854 0.0003
Sweetness 0.9549 0.0030
pH 0.8797 0.0208
Aciditiy −0.9127 0.0111
Bitterness −0.9348 0.0062
TABLE 1.12
Orange Juice Data: Description of the First
Dimension by the Categorical Variables and
the Categories of These Categorical Variables
$Dim.1$quali
R2 p-value
Origin 0.8458 0.0094
$Dim.1$category
Estimate p-value
Florida 2.0031 0.0094
Other −2.0031 0.0094
Though for some years past more than three-quarters of all of the
iron ore used in the United States has come from seven or eight
mines in the northern peninsula of Michigan and the adjacent part of
Minnesota, it must not be understood that the Lake Superior mines
are the only ore deposits in this country. Figures show that such an
inference is far from the truth. It is true, however, that they have
made the United States what it is, the leading iron producer of the
world. There are still immense quantities to be mined on the Lake
Superior ranges. Their heavy production of cheaply handled high
grade ore has, of course, held back development of other districts,
which also have great natural resources. The Birmingham, Ala.,
region for instance, is a great ore and iron producer, right now
producing the third largest tonnage of any district in the country.
Some time in the not far off future, Alabama with her great deposits
of iron ore, coal, and other natural resources is going to announce
herself in no small voice. New York, Pennsylvania, Tennessee, and
Virginia rank next after Minnesota, Michigan, and Alabama as ore
producers, and several other states of the Union are not paupers in
resources of iron ore.
We should not get so enthusiastic
over our ore supply and iron
production as to think that other
countries are devoid of such material.
Almost every civilized country has ore
enough that it does pretty well. With
many the trouble is that the ore has
objectionable constituents or that
Hulett Grab Buckets in supply of cheap fuel is not available.
the Hold of an Ore Boat Germany has large deposits of iron ore,
but until the invention of the basic
Bessemer process about 1870 she was
handicapped because of the high phosphorus content of her ore. The
basic processes, both Bessemer and open-hearth, allow of the
removal of this phosphorus during the conversion into steel, and
they therefore brought Germany to the front as an iron producer.
The excellence of Sweden’s iron and steel has long been known the
world over. Sweden produces approximately one per cent of the
world’s total production of iron and steel, but her ore has been of
such high grade that iron made from it has maintained its position as
a standard for use in the manufacture of highest grade crucible
steels. The very finest steels for cutlery and tools, and even the softer
grades of steel of northwestern Europe, have been made from
Swedish iron as a base.
Iron ore, of course, is classified by geologists and chemists into
varieties with such names as hematite, magnetite, siderite, etc.,
which here little concern us.
To be worked at a profit, the iron content of the ore must be high
with the smallest possible amounts of undesirable impurities,
particularly phosphorus, sulphur, and silica. There are, however,
certain impurities which are not undesirable, for instance, lime,
which will act as a flux and neutralize the effect of some of the
undesirable impurities. For these reasons the prices for iron ore are
based on the iron content and modified by the relative amounts of
undesirable and desirable impurities. Phosphorus is almost a
domineering factor and at present approximately fifty cents a ton
more is paid for Bessemer ore (that containing less than .050 per
cent phosphorus) than for non-Bessemer ore. As might be expected
the best ores have been the first used and the grade is constantly
falling. Instead of the 66 per cent iron ores of some years ago those
coming nowadays contain not much more than 59 per cent of iron
and the Bessemer ores described above are getting scarcer, so that
for some years practically all of the furnaces have been mixing with
them as much higher phosphorus ore as could be used without
pushing the phosphorus content of the mixture over the allowable
limit.
We often hear people surmising what is to become of us when all
of the iron ore of this planet has been used. There is no harm in
taking stock of resources and in this case it does us much good. It
happens that each time the count is taken of iron ore available and
that which under future and better methods of working can be
utilized, we find ourselves immensely better off than the previous
report had made out and we have less cause to worry about the
future. The last inventory was taken by the extremely ambitious
International Geological Congress held at Stockholm, Sweden, in
1910. It shows that the world yet has enough rich ore to make
10,192,000,000 tons of iron, and, a further supply of ore for
53,136,000,000 tons of iron, which could be used if necessary.
So we will get along for a while yet.
CHAPTER III
THE OTHER RAW MATERIALS
Up the dark tower shoots the elevator with its “buggy” of coke. Its
speed is not conditioned to the comfort of man, who is not supposed
to be a passenger, except the occasional laborer whose duty as buggy-
pusher requires his presence on twelve-hour shifts at the top. So we,
whose exploratory proclivities have led us at the office to sign away
our lives for grant of a pass to the blast furnace, find our breath about
taken from us with the first mad dash into the darkness of the climb.
That stone tower had looked much more innocent from below.
But now the rickety elevator has as suddenly emerged into the light
again and stopped abruptly at the charging floor which extends
across the chasm to the top of the furnace.
As the smoke-begrimed buggy-pushers rush the buggy of coke
across to the furnace bell, we have opportunity to notice that we are a
full hundred feet above ground. Just here, seemingly so close that we
can put our hands on them, in a row, are the round steel tops of the
four stoves which are for the purpose of preheating the blast. The
huge pipes, dust arresters, tanks, and buildings, all so necessary to
the plant, look almost like a tangled mass from our high station,
while the charging floor upon which we stand, the shoulder-high
steel fence around it, the furnace top, the adjacent stoves and in fact
everything for a half mile around us, is colored yellow-red with iron
dust. We understand the reason for this when the buggies of ore
which have succeeded the coke are dumped into the funnel-shaped
depression around the conical bell at the center. As the huge bell is
lowered and the charge slides in there is considerable blowing out of
the fine ore dust, which, in fact, continually “oozes” out of all crevices
under the heavy pressure of the blast inside.
Day and night, month in and month
out, during the life of the fire-brick
lining of the furnace, this routine of
charging, first coke, next the
theoretically correct charge of analyzed
iron ores, then limestone, in rotation
goes on. From 6 A.M. to 6 P.M. and from
6 P.M. to 6 A.M. on twelve-hour shifts,
alternating gangs of laborers push the
buggies across to the furnace top,
Hand-fed Blast Furnace dump and return them to the elevator
already up with another load.
The incessant quiver of the iron
plates beneath our feet with the rumbling and groaning from the
inside of this monster are disquieting and the thought constantly
recurs: “What if this powerful creature should just now rebel, as quite
occasionally occurred in the old days when all of its moods had not
been so well understood?” For this king of metallurgical devices,
though gentle and obedient as a lamb under proper treatment, is a
domineering fury when it has dyspepsia as occurs whenever its
attendants are remiss in their attentions to its diet. “Those explosion
doors just below the furnace top—are they in working order and
would they be adequate?” But whether, as in recorded instances, the
whole furnace top is torn off as evidence of its wrath, or its
displeasure is exhibited in a milder way, we much prefer to be absent.
The thought is disquieting and we are glad to leave.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com