0% found this document useful (0 votes)
28 views67 pages

Get Exploratory Multivariate Analysis by Example Using R Second Edition Husson Free All Chapters

The document provides information about the book 'Exploratory Multivariate Analysis by Example Using R, Second Edition' by François Husson, Sébastien Lê, and Jérôme Pagès, including download links and ISBN details. It also lists additional recommended ebooks related to data analysis and statistics. The book is part of the Chapman & Hall/CRC Computer Science and Data Analysis Series and aims to integrate computer and statistical sciences.

Uploaded by

gaeteanekaem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views67 pages

Get Exploratory Multivariate Analysis by Example Using R Second Edition Husson Free All Chapters

The document provides information about the book 'Exploratory Multivariate Analysis by Example Using R, Second Edition' by François Husson, Sébastien Lê, and Jérôme Pagès, including download links and ISBN details. It also lists additional recommended ebooks related to data analysis and statistics. The book is part of the Chapman & Hall/CRC Computer Science and Data Analysis Series and aims to integrate computer and statistical sciences.

Uploaded by

gaeteanekaem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Visit https://ebookfinal.

com to download the full version and


explore more ebooks

Exploratory Multivariate Analysis by Example


Using R Second Edition Husson

_____ Click the link below to download _____


https://ebookfinal.com/download/exploratory-
multivariate-analysis-by-example-using-r-second-
edition-husson/

Explore and download more ebooks at ebookfinal.com


Here are some suggested products you might be interested in.
Click the link to download

Data Analysis and Graphics Using R An Example Based


Approach Third Edition John Maindonald

https://ebookfinal.com/download/data-analysis-and-graphics-using-r-an-
example-based-approach-third-edition-john-maindonald/

SAS Functions by Example Second Edition Ron Cody

https://ebookfinal.com/download/sas-functions-by-example-second-
edition-ron-cody/

Think Stats Exploratory Data Analysis Second Edition Allen


B. Downey

https://ebookfinal.com/download/think-stats-exploratory-data-analysis-
second-edition-allen-b-downey/

Applied Multivariate Data Analysis Second Edition Brian S.


Everitt

https://ebookfinal.com/download/applied-multivariate-data-analysis-
second-edition-brian-s-everitt/
Using Multivariate Statistics Barbara G. Tabachnick

https://ebookfinal.com/download/using-multivariate-statistics-barbara-
g-tabachnick/

Exploratory Network Analysis with Pajek Wouter De Nooy

https://ebookfinal.com/download/exploratory-network-analysis-with-
pajek-wouter-de-nooy/

Handbook of Univariate and Multivariate Data Analysis with


IBM SPSS Second Edition Robert Ho

https://ebookfinal.com/download/handbook-of-univariate-and-
multivariate-data-analysis-with-ibm-spss-second-edition-robert-ho/

SFML game development by example create and develop


exciting games from start to finish using SFML Pupius

https://ebookfinal.com/download/sfml-game-development-by-example-
create-and-develop-exciting-games-from-start-to-finish-using-sfml-
pupius/

Structural Aspects in the Theory of Probability Second


Edition Series on Multivariate Analysis Herbert Heyer

https://ebookfinal.com/download/structural-aspects-in-the-theory-of-
probability-second-edition-series-on-multivariate-analysis-herbert-
heyer/
Exploratory Multivariate Analysis by Example Using R
Second Edition Husson Digital Instant Download
Author(s): Husson, François; Lê, Sébastien; Pagès, Jérôme
ISBN(s): 9781315301860, 1315301865
Edition: Second edition
File Details: PDF, 18.05 MB
Year: 2017
Language: english
Chapman & Hall/CRC
Computer Science and Data Analysis Series

Exploratory Multivariate
Analysis by Example Using R

François Husson
Sébastien Lê
Jérôme Pagès

Boca Raton London New York

CRC Press is an imprint of the


Taylor & Francis Group, an informa business
A CHAPMAN & HALL BOOK
Chapman & Hall/CRC
Computer Science and Data Analysis Series

The interface between the computer and statistical sciences is increasing, as each
discipline seeks to harness the power and resources of the other. This series aims to
foster the integration between the computer sciences and statistical, numerical, and
probabilistic methods by publishing a broad range of reference works, textbooks, and
handbooks.

SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London

Proposals for the series should be sent directly to one of the series editors above, or submitted to:

Chapman & Hall/CRC


Taylor and Francis Group
3 Park Square, Milton Park
Abingdon, OX14 4RN, UK

Published Titles

Semisupervised Learning for Computational Linguistics


Steven Abney

Visualization and Verbalization of Data


Jörg Blasius and Michael Greenacre

Design and Modeling for Computer Experiments


Kai-Tai Fang, Runze Li, and Agus Sudjianto

Microarray Image Analysis: An Algorithmic Approach


Karl Fraser, Zidong Wang, and Xiaohui Liu

R Programming for Bioinformatics


Robert Gentleman

Exploratory Multivariate Analysis by Example Using R


François Husson, Sébastien Lê, and Jérôme Pagès

Bayesian Artificial Intelligence, Second Edition


Kevin B. Korb and Ann E. Nicholson
Published Titles cont.

®
Computational Statistics Handbook with MATLAB , Third Edition
Wendy L. Martinez and Angel R. Martinez

Exploratory Data Analysis with MATLAB , Third Edition


®

Wendy L. Martinez, Angel R. Martinez, and Jeffrey L. Solka

Statistics in MATLAB®: A Primer


Wendy L. Martinez and MoonJung Cho

Clustering for Data Mining: A Data Recovery Approach, Second Edition


Boris Mirkin

Introduction to Machine Learning and Bioinformatics


Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis

Introduction to Data Technologies


Paul Murrell

R Graphics
Paul Murrell

Data Science Foundations: Geometry and Topology of Complex Hierarchic


Systems and Big Data Analytics
Fionn Murtagh

Correspondence Analysis and Data Coding with Java and R


Fionn Murtagh

Pattern Recognition Algorithms for Data Mining


Sankar K. Pal and Pabitra Mitra

Statistical Computing with R


Maria L. Rizzo

Statistical Learning and Data Science


Mireille Gettler Summa, Léon Bottou, Bernard Goldfarb, Fionn Murtagh,
Catherine Pardoux, and Myriam Touati

Music Data Analysis: Foundations and Applications


Claus Weihs, Dietmar Jannach, Igor Vatolkin, and Günter Rudolph

Foundations of Statistical Algorithms: With References to R Packages


Claus Weihs, Olaf Mersmann, and Uwe Ligges
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20170331

International Standard Book Number-13: 978-1-1381-9634-6 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface xi

1 Principal Component Analysis (PCA) 1


1.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Studying Individuals . . . . . . . . . . . . . . . . . . . 2
1.2.2 Studying Variables . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Relationships between the Two Studies . . . . . . . . 5
1.3 Studying Individuals . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 The Cloud of Individuals . . . . . . . . . . . . . . . . 5
1.3.2 Fitting the Cloud of Individuals . . . . . . . . . . . . 7
1.3.2.1 Best Plane Representation of NI . . . . . . . 7
1.3.2.2 Sequence of Axes for Representing NI . . . . 10
1.3.2.3 How Are the Components Obtained? . . . . 10
1.3.2.4 Example . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Representation of the Variables as an Aid for
Interpreting the Cloud of Individuals . . . . . . . . . . 11
1.4 Studying Variables . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 The Cloud of Variables . . . . . . . . . . . . . . . . . 13
1.4.2 Fitting the Cloud of Variables . . . . . . . . . . . . . . 14
1.5 Relationships between the Two Representations NI and NK 16
1.6 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 17
1.6.1 Numerical Indicators . . . . . . . . . . . . . . . . . . . 17
1.6.1.1 Percentage of Inertia Associated with a
Component . . . . . . . . . . . . . . . . . . . 17
1.6.1.2 Quality of Representation of an Individual or
Variable . . . . . . . . . . . . . . . . . . . . . 18
1.6.1.3 Detecting Outliers . . . . . . . . . . . . . . . 19
1.6.1.4 Contribution of an Individual or Variable to
the Construction of a Component . . . . . . 19
1.6.2 Supplementary Elements . . . . . . . . . . . . . . . . . 20
1.6.2.1 Representing Supplementary Quantitative
Variables . . . . . . . . . . . . . . . . . . . . 21
1.6.2.2 Representing Supplementary Categorical
Variables . . . . . . . . . . . . . . . . . . . . 22
1.6.2.3 Representing Supplementary Individuals . . 24

v
vi Contents

1.6.3 Automatic Description of the Components . . . . . . . 24


1.7 Implementation with FactoMineR . . . . . . . . . . . . . . . 25
1.8 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8.1 Testing the Significance of the Components . . . . . . 26
1.8.2 Variables: Loadings versus Correlations . . . . . . . . 27
1.8.3 Simultaneous Representation: Biplots . . . . . . . . . 27
1.8.4 Missing Values . . . . . . . . . . . . . . . . . . . . . . 28
1.8.5 Large Datasets . . . . . . . . . . . . . . . . . . . . . . 29
1.8.6 Varimax Rotation . . . . . . . . . . . . . . . . . . . . 29
1.9 Example: The Decathlon Dataset . . . . . . . . . . . . . . . 30
1.9.1 Data Description — Issues . . . . . . . . . . . . . . . 30
1.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 30
1.9.2.1 Choice of Active Elements . . . . . . . . . . 30
1.9.2.2 Should the Variables Be Standardised? . . . 32
1.9.3 Implementation of the Analysis . . . . . . . . . . . . . 32
1.9.3.1 Choosing the Number of Dimensions to
Examine . . . . . . . . . . . . . . . . . . . . 34
1.9.3.2 Studying the Cloud of Individuals . . . . . . 35
1.9.3.3 Studying the Cloud of Variables . . . . . . . 38
1.9.3.4 Joint Analysis of the Cloud of Individuals and
the Cloud of Variables . . . . . . . . . . . . . 40
1.9.3.5 Comments on the Data . . . . . . . . . . . . 43
1.10 Example: The Temperature Dataset . . . . . . . . . . . . . . 45
1.10.1 Data Description — Issues . . . . . . . . . . . . . . . 45
1.10.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 45
1.10.2.1 Choice of Active Elements . . . . . . . . . . 45
1.10.2.2 Should the Variables Be Standardised? . . . 46
1.10.3 Implementation of the Analysis . . . . . . . . . . . . . 47
1.11 Example of Genomic Data: The Chicken Dataset . . . . . . 53
1.11.1 Data Description — Issues . . . . . . . . . . . . . . . 53
1.11.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 54
1.11.3 Implementation of the Analysis . . . . . . . . . . . . . 54

2 Correspondence Analysis (CA) 61


2.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 61
2.2 Objectives and the Independence Model . . . . . . . . . . . . 63
2.2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2.2 Independence Model and χ2 Test . . . . . . . . . . . . 64
2.2.3 The Independence Model and CA . . . . . . . . . . . 66
2.3 Fitting the Clouds . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.1 Clouds of Row Profiles . . . . . . . . . . . . . . . . . . 67
2.3.2 Clouds of Column Profiles . . . . . . . . . . . . . . . . 68
2.3.3 Fitting Clouds NI and NJ . . . . . . . . . . . . . . . . 70
2.3.4 Example: Women’s Attitudes to Women’s Work in France
in 1970 . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Contents vii

2.3.4.1 Column Representation (Mother’s Activity) . 72


2.3.4.2 Row Representation (Partner’s Work) . . . . 74
2.3.5 Superimposed Representation of Both Rows and
Columns . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 79
2.4.1 Inertias Associated with the Dimensions (Eigenvalues) 79
2.4.2 Contribution of Points to a Dimension’s Inertia . . . . 82
2.4.3 Representation Quality of Points on a Dimension or
Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.4.4 Distance and Inertia in the Initial Space . . . . . . . . 84
2.5 Supplementary Elements (= Illustrative) . . . . . . . . . . . 85
2.6 Implementation with FactoMineR . . . . . . . . . . . . . . . 88
2.7 CA and Textual Data Processing . . . . . . . . . . . . . . . . 90
2.8 Example: The Olympic Games Dataset . . . . . . . . . . . . 94
2.8.1 Data Description — Issues . . . . . . . . . . . . . . . 94
2.8.2 Implementation of the Analysis . . . . . . . . . . . . . 96
2.8.2.1 Choosing the Number of Dimensions to
Examine . . . . . . . . . . . . . . . . . . . . 98
2.8.2.2 Studying the Superimposed Representation . 98
2.8.2.3 Interpreting the Results . . . . . . . . . . . . 101
2.8.2.4 Comments on the Data . . . . . . . . . . . . 102
2.9 Example: The White Wines Dataset . . . . . . . . . . . . . . 104
2.9.1 Data Description — Issues . . . . . . . . . . . . . . . 104
2.9.2 Margins . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.9.3 Inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
2.9.4 Representation on the First Plane . . . . . . . . . . . 109
2.10 Example: The Causes of Mortality Dataset . . . . . . . . . . 112
2.10.1 Data Description — Issues . . . . . . . . . . . . . . . 112
2.10.2 Margins . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.10.3 Inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.10.4 First Dimension . . . . . . . . . . . . . . . . . . . . . 118
2.10.5 Plane 2-3 . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.10.6 Projecting the Supplementary Elements . . . . . . . . 124
2.10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 127

3 Multiple Correspondence Analysis (MCA) 131


3.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 131
3.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.2.1 Studying Individuals . . . . . . . . . . . . . . . . . . . 132
3.2.2 Studying the Variables and Categories . . . . . . . . . 133
3.3 Defining Distances between Individuals and Distances between
Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.3.1 Distances between the Individuals . . . . . . . . . . . 134
3.3.2 Distances between the Categories . . . . . . . . . . . . 134
3.4 CA on the Indicator Matrix . . . . . . . . . . . . . . . . . . 136
viii Contents

3.4.1 Relationship between MCA and CA . . . . . . . . . . 136


3.4.2 The Cloud of Individuals . . . . . . . . . . . . . . . . 137
3.4.3 The Cloud of Variables . . . . . . . . . . . . . . . . . 138
3.4.4 The Cloud of Categories . . . . . . . . . . . . . . . . . 139
3.4.5 Transition Relations . . . . . . . . . . . . . . . . . . . 142
3.5 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 144
3.5.1 Numerical Indicators . . . . . . . . . . . . . . . . . . . 144
3.5.1.1 Percentage of Inertia Associated with a
Component . . . . . . . . . . . . . . . . . . . 144
3.5.1.2 Contribution and Representation Quality of
an Individual or Category . . . . . . . . . . . 145
3.5.2 Supplementary Elements . . . . . . . . . . . . . . . . . 146
3.5.3 Automatic Description of the Components . . . . . . . 147
3.6 Implementation with FactoMineR . . . . . . . . . . . . . . . 149
3.7 Addendum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.7.1 Analysing a Survey . . . . . . . . . . . . . . . . . . . . 152
3.7.1.1 Designing a Questionnaire: Choice of Format 152
3.7.1.2 Accounting for Rare Categories . . . . . . . . 153
3.7.2 Description of a Categorical Variable or a
Subpopulation . . . . . . . . . . . . . . . . . . . . . . 154
3.7.2.1 Description of a Categorical Variable by a
Categorical Variable . . . . . . . . . . . . . . 154
3.7.2.2 Description of a Subpopulation (or a
Category) by a Quantitative Variable . . . . 155
3.7.2.3 Description of a Subpopulation (or a
Category) by the Categories of a Categorical
Variable . . . . . . . . . . . . . . . . . . . . . 156
3.7.3 The Burt Table . . . . . . . . . . . . . . . . . . . . . . 157
3.7.4 Missing Values . . . . . . . . . . . . . . . . . . . . . . 158
3.8 Example: The Survey on the Perception of Genetically
Modified Organisms . . . . . . . . . . . . . . . . . . . . . . . 160
3.8.1 Data Description — Issues . . . . . . . . . . . . . . . 160
3.8.2 Analysis Parameters and Implementation with
FactoMineR . . . . . . . . . . . . . . . . . . . . . . . . 163
3.8.3 Analysing the First Plane . . . . . . . . . . . . . . . . 164
3.8.4 Projection of Supplementary Variables . . . . . . . . . 165
3.8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 167
3.9 Example: The Sorting Task Dataset . . . . . . . . . . . . . . 167
3.9.1 Data Description — Issues . . . . . . . . . . . . . . . 167
3.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 169
3.9.3 Representation of Individuals on the First Plane . . . 169
3.9.4 Representation of Categories . . . . . . . . . . . . . . 170
3.9.5 Representation of the Variables . . . . . . . . . . . . . 171
Contents ix

4 Clustering 173
4.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.2 Formalising the Notion of Similarity . . . . . . . . . . . . . . 177
4.2.1 Similarity between Individuals . . . . . . . . . . . . . 177
4.2.1.1 Distances and Euclidean Distances . . . . . . 177
4.2.1.2 Example of Non-Euclidean Distance . . . . . 178
4.2.1.3 Other Euclidean Distances . . . . . . . . . . 179
4.2.1.4 Similarities and Dissimilarities . . . . . . . . 179
4.2.2 Similarity between Groups of Individuals . . . . . . . 180
4.3 Constructing an Indexed Hierarchy . . . . . . . . . . . . . . 181
4.3.1 Classic Agglomerative Algorithm . . . . . . . . . . . . 181
4.3.2 Hierarchy and Partitions . . . . . . . . . . . . . . . . . 183
4.4 Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.4.1 Partition Quality . . . . . . . . . . . . . . . . . . . . . 184
4.4.2 Agglomeration According to Inertia . . . . . . . . . . 185
4.4.3 Two Properties of the Agglomeration Criterion . . . . 187
4.4.4 Analysing Hierarchies, Choosing Partitions . . . . . . 188
4.5 Direct Search for Partitions: K-Means Algorithm . . . . . . 189
4.5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . 189
4.5.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . 191
4.6 Partitioning and Hierarchical Clustering . . . . . . . . . . . . 191
4.6.1 Consolidating Partitions . . . . . . . . . . . . . . . . . 192
4.6.2 Mixed Algorithm . . . . . . . . . . . . . . . . . . . . . 192
4.7 Clustering and Principal Component Methods . . . . . . . . 192
4.7.1 Principal Component Methods Prior to AHC . . . . . 193
4.7.2 Simultaneous Analysis of a Principal Component Map
and Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 193
4.8 Clustering and Missing Data . . . . . . . . . . . . . . . . . . 194
4.9 Example: The Temperature Dataset . . . . . . . . . . . . . . 194
4.9.1 Data Description — Issues . . . . . . . . . . . . . . . 194
4.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 195
4.9.3 Implementation of the Analysis . . . . . . . . . . . . . 195
4.10 Example: The Tea Dataset . . . . . . . . . . . . . . . . . . . 199
4.10.1 Data Description — Issues . . . . . . . . . . . . . . . 199
4.10.2 Constructing the AHC . . . . . . . . . . . . . . . . . . 201
4.10.3 Defining the Clusters . . . . . . . . . . . . . . . . . . . 202
4.11 Dividing Quantitative Variables into Classes . . . . . . . . . 204

5 Visualisation 209
5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.2 Viewing PCA Data . . . . . . . . . . . . . . . . . . . . . . . 209
5.2.1 Selecting a Subset of Objects — Cloud of Individuals 210
5.2.2 Selecting a Subset of Objects — Cloud of Variables . . 211
5.2.3 Adding Supplementary Information . . . . . . . . . . 212
x Contents

5.3 Viewing Data from a CA . . . . . . . . . . . . . . . . . . . . 213


5.3.1 Selecting a Subset of Objects — Cloud of Rows or Col-
umns . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.3.2 Adding Supplementary Information . . . . . . . . . . 216
5.4 Viewing MCA Data . . . . . . . . . . . . . . . . . . . . . . . 216
5.4.1 Selecting a Subset of Objects — Cloud of Individuals 217
5.4.2 Selecting a Subset of Objects — Cloud of Categories . 217
5.4.3 Selecting a Subset of Objects — Clouds of Variables . 218
5.4.4 Adding Supplementary Information . . . . . . . . . . 218
5.5 Alternatives to the Graphics Function in the FactoMineR Pack-
age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.5.1 The Factoshiny Package . . . . . . . . . . . . . . . . . 219
5.5.2 The factoextra Package . . . . . . . . . . . . . . . . . . 221
5.6 Improving Graphs Using Arguments Common to Many Fac-
toMineR Graphical Functions . . . . . . . . . . . . . . . . . . 221

Appendix 225
A.1 Percentage of Inertia Explained by the First Component or by
the First Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 225
A.2 R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
A.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 230
A.2.2 The Rcmdr Package . . . . . . . . . . . . . . . . . . . 234
A.2.3 The FactoMineR Package . . . . . . . . . . . . . . . . 236

Bibliography of Software Packages 241

Bibliography 243

Index 245
Preface

Qu’est-ce que l’analyse des données ? (English: What is data analysis?)


As it is usually understood in France, and within the context of this book,
the expression analyse des données reflects a set of statistical methods whose
main features are to be multidimensional and descriptive.
The term multidimensional itself covers two aspects. First, it implies
that observations (or, in other words, individuals) are described by several
variables. In this introduction we restrict ourselves to the most common
data, those in which a group of individuals is described by one set of variables.
But, beyond the fact that we have many values from many variables for each
observation, it is the desire to study them simultaneously that is characteristic
of a multidimensional approach. Thus, we will use those methods each time
the notion of profile is relevant when considering an individual, for example,
the response profile of consumers, the biometric profile of plants, the financial
profile of businesses, and so forth.
From another point of view, the interest of considering values of indi-
viduals for a set of variables in a global manner lies in the fact that these
variables are linked. Let us note that studying links between all the vari-
ables taken two-by-two does not constitute a multidimensional approach in
the strict sense. This approach involves the simultaneous consideration of all
the links between variables taken two-by-two. That is what is done, for exam-
ple, when highlighting a synthetic variable: such a variable represents several
others, which implies that it is linked to each of them, which is only possible
if they are themselves linked two-by-two. The concept of synthetic variable is
intrinsically multidimensional and is a powerful tool for the description of an
individuals × variables table. In both respects, it is a key concept within the
context of this book.
One last comment about the term analyse des données since it can have at
least two meanings — the one defined previously and another broader one that
could be translated as “statistical investigation.” This second meaning is from
a user’s standpoint; it is defined by an objective (to analyse data) and says
nothing about the statistical methods to be used. This is what the English
term data analysis covers. The term data analysis, in the sense of a set of
descriptive multidimensional methods, is more of a French statistical point of
view. It was introduced in France in the 1960s by Jean-Paul Benzécri and the
adoption of this term is probably related to the fact that these multivariate
methods are at the heart of many “data analyses.”

xi
xii Preface

To Whom Is This Book Addressed?


This book has been designed for scientists whose aim is not to become statis-
ticians but who feel the need to analyse data themselves. It is therefore
addressed to practitioners who are confronted with the analysis of data. From
this perspective it is application oriented; formalism and mathematics writing
have been reduced as much as possible while examples and intuition have been
emphasised. Specifically, an undergraduate level is quite sufficient to capture
all the concepts introduced.
On the software side, an introduction to the R language is sufficient, at
least at first. This software is free and available on the Internet at the following
address: http://www.r-project.org/.
Content and Spirit of the Book
This book focuses on four essential and basic methods of multivariate ex-
ploratory data analysis, those with the largest potential in terms of applica-
tions: principal component analysis (PCA) when variables are quantitative,
correspondence analysis (CA) and multiple correspondence analysis (MCA)
when variables are categorical and hierarchical cluster analysis. The geo-
metric point of view used to present all these methods constitutes a unique
framework in the sense that it provides a unified vision when exploring mul-
tivariate data tables. Within this framework, we will present the principles,
the indicators, and the ways of representing and visualising objects (rows and
columns of a data table) that are common to all those exploratory methods.
From this standpoint, adding supplementary information by simply projecting
vectors is commonplace. Thus, we will show how it is possible to use categor-
ical variables within a PCA context where variables that are to be analysed
are quantitative, to handle more than two categorical variables within a CA
context where originally there are two variables, and to add quantitative vari-
ables within an MCA context where variables are categorical. More than
the theoretical aspects and the specific indicators induced by our geometrical
viewpoint, we will illustrate the methods and the way they can be exploited
using examples from various fields, hence the name of the book.
Throughout the text, each result correlates with its R command. All these
commands are accessible from FactoMineR, an R package developed by the
authors. The reader will be able to conduct all the analyses of the book as
all the datasets (as well as all the lines of code) are available at the following
website address: http://factominer.free.fr/bookV2. We hope that with
this book, the reader will be fully equipped (theory, examples, software) to
confront multivariate real-life data.
Note on the Second Edition
There were two main reasons behind the second edition of this work. The
first was that we wanted to add a chapter on viewing and improving the graphs
produced by the FactoMineR software. The second was to add a section to
Preface xiii

each chapter on managing missing data, which will enable users to conduct
analyses from incomplete tables more easily.
The authors would like to thank Rebecca Clayton for her help in the transla-
tion.
1
Principal Component Analysis (PCA)

1.1 Data — Notation — Examples


Principal component analysis (PCA) applies to data tables where rows are
considered as individuals and columns as quantitative variables. Let xik be
the value taken by individual i for variable k, where i varies from 1 to I and
k from 1 to K.
Let x̄k denote the mean of variable k calculated over all individual instances
of I:
I
1X
x̄k = xik ,
I i=1

and sk the standard deviation of the sample of variable k (uncorrected):


v
u I
u1 X
sk = t (xik − x̄k )2 .
I i=1

Data subjected to a PCA can be very diverse in nature; some examples


are listed in Table 1.1.
This first chapter will be illustrated using the “orange juice” dataset chosen
for its simplicity since it comprises only six statistical individuals or observa-
tions. The six orange juices were evaluated by a panel of experts according
to seven sensory variables (odour intensity, odour typicality, pulp content, in-
tensity of taste, acidity, bitterness, sweetness). The panel’s evaluations are
summarised in Table 1.2.

1.2 Objectives
The data table can be considered either as a set of rows (individuals) or as a
set of columns (variables), thus raising a number of questions relating to these
different types of objects.
2 Exploratory Multivariate Analysis by Example Using R
TABLE 1.1
Some Examples of Datasets
Field Individuals Variables xik
Ecology Rivers Concentration of pollutants Concentration of pollu-
tant k in river i
Economics Years Economic indicators Indicator value k for year
i
Genetics Patients Genes Expression of gene k for
patient i
Marketing Brands Measures of satisfaction Value of measure k for
brand i
Pedology Soils Granulometric composition Content of component k
in soil i
Biology Animals Measurements Measure k for animal i

Sociology Social classes Time by activity Time spent on activity k


by individuals from so-
cial class i

TABLE 1.2
The Orange Juice Data
Odour Odour Pulp Intensity Acidity Bitter- Sweet-
intensity typicality of taste ness ness
Pampryl amb. 2.82 2.53 1.66 3.46 3.15 2.97 2.60
Tropicana amb. 2.76 2.82 1.91 3.23 2.55 2.08 3.32
Fruvita fr. 2.83 2.88 4.00 3.45 2.42 1.76 3.38
Joker amb. 2.76 2.59 1.66 3.37 3.05 2.56 2.80
Tropicana fr. 3.20 3.02 3.69 3.12 2.33 1.97 3.34
Pampryl fr. 3.07 2.73 3.34 3.54 3.31 2.63 2.90

1.2.1 Studying Individuals


Figure 1.1 illustrates the types of questions posed during the study of individ-
uals. This diagram represents three different situations where 40 individuals
are described in terms of two variables: j and k. In graph A, we can clearly
identify two distinct classes of individuals. Graph B illustrates a dimension of
variability which opposes extreme individuals, much like graph A, but which
also contains less extreme individuals. The cloud of individuals is therefore
long and thin. Graph C depicts a more uniform cloud (i.e., with no specific
structure).
Interpreting the data depicted in these examples is relatively straightfor-
ward as they are two dimensional. However, when individuals are described
by a large number of variables, we require a tool to explore the space in which
these individuals evolve. Studying individuals means identifying the similari-
ties between individuals from the point of view of all the variables. In other
words, to provide a typology of the individuals: which are the most similar
individuals (and the most dissimilar)? Are there groups of individuals which
Principal Component Analysis 3
A B C

2
1.0
l l l
l l
ll l l

1.0
l l l l l
l
l l
ll l
ll l l
ll
ll
l l l
l l l

0.5

1
l l l
l l
0.5

l ll l
l
l l
l l ll
l l
l
Variable k

Variable k

Variable k
l l
l l
l
l ll

0.0
0.0

0
l
l l l
l l l l
l
l
l l
l l
l l
l l
−0.5

l
l

−0.5
l

−1
l l l
l l
l l
l
l
l

l l l l l
l l
−1.0

l l l
l
l ll
l l l
l l l

−1.0
l l
l

−2
l
l l l

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2 3
Variable j Variable j Variable j

FIGURE 1.1
Representation of 40 individuals described by two variables: j and k.

are homogeneous in terms of their similarities? In addition, we should look


for common dimensions of variability which oppose extreme and intermediate
individuals.
In the example, two orange juices are considered similar if they were eval-
uated in the same way according to all the sensory descriptors. In such cases,
the two orange juices have the same main dimensions of variability and are
thus said to have the same sensory “profile.” More generally, we want to know
whether or not there are groups of orange juices with similar profiles, that is,
sensory dimensions which might oppose extreme juices with more intermediate
juices.

1.2.2 Studying Variables


Following the approach taken to study the individuals, might it also be possi-
ble to interpret the data from the variables? PCA focuses on the linear rela-
tionships between variables. More complex links also exist, such as quadratic
relationships, logarithmics, exponential functions, and so forth, but they are
not studied in PCA. This may seem restrictive, but in practice many relation-
ships can be considered linear, at least for an initial approximation.
Let us consider the example of the four variables (j, k, l, and m) in Fig-
ure 1.2. The clouds of points constructed by working from pairs of variables
show that variables j and k (graph A) as well as variables l and m (graph F)
are strongly correlated (positively for j and k and negatively for l and m).
However, the other graphs do not show any signs of relationships between
variables. The study of these variables also suggests that the four variables
are split into two groups of two variables, (j, k) and (l, m), and that, within
one group, the variables are strongly correlated, whereas between groups, the
variables are uncorrelated. In exactly the same way as for constructing groups
of individuals, creating groups of variables may be useful with a view to syn-
thesis. As for the individuals, we identify a continuum with groups of both
4 Exploratory Multivariate Analysis by Example Using R

very unusual variables and intermediate variables, which are to some extent
linked to both groups. In the example, each group can be represented by one
single variable as the variables within each group are very strongly correlated.
We refer to these variables as synthetic variables.
A B C
1.0

l
l l l
l l l
l l l

0.0

0.0
l
l
l ll l l l l
ll l l
l
l l l l l l l
l l l
l l

−0.6 −0.4 −0.2

−0.6 −0.4 −0.2


l l l l
0.5

l l l l
l l l l l
l l l
l
ll l l l l
l l l l
l l l l
l l l l l
l l l
Variable k

l l l

Variable l

Variable l
l l
l l l l
l l
l l l
l
0.0

l l l
l l l l l
l
l ll ll l l
l ll l ll
l l l l
l ll l l l
l l
l l
l l l
l l
l l l l l l
ll l l
l l l l
l ll l l l l
−0.5

l l l l
l l
l l
l
l l l l
l l l
l l
l l l l
l l l l
−0.8

−0.8
l l l l l l
l l l
l
l ll
l l l l
l l l l l l l
−1.0

l l l l
l l l l l l l l
l l l l
l l
l l l l
−1.0

−1.0
l l l
l l l l l
l l
l l l

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Variable j Variable j Variable k
D E F
l l l
l l l
l l l
0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0


l l l
l l l
l l l
l l l l ll ll l
l l l
l l l l l l
l l l l l l l l l
l l l
l l l
l l l l l l l l l
l l l l l l
l l l l l l
l l l l l l
l l l
l l l l l l l l l
l l l l l l
l l l
Variable m

Variable m

Variable m
l l l
l l l
l l l l l l
ll l l l l l l l
l l l l l l
l l l l l l
l l l l ll l l l
l l l
l l l l l l
l l l
l l l
l l l
l l l l l l
l l l l l l
l l l
l l l
l l l l l l
l l l l l l
l l l
l l l l l l l l l
l l l l l l
l l l
l l l l l l
l l l
l l l l l l

l l l l l l

l l l

l l l
−0.2

−0.2

−0.2

l l l

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.8 −0.6 −0.4 −0.2 0.0
Variable j Variable k Variable l

FIGURE 1.2
Representation of the relationships between four variables: j, k, l, and m,
taken two-by-two.

When confronted with a very small number of variables, it is possible to


draw conclusions from the clouds of points, or from the correlation matrix
which groups together all of the linear correlation coefficients r(j, k) between
the pairs of variables. However, when working with a great number of vari-
ables, the correlation matrix groups together a large quantity of correlation
coefficients (190 coefficients for K = 20 variables). It is therefore essential to
have a tool capable of summarising the main relationships between the vari-
ables in a visual manner. The aim of PCA is to draw conclusions from the
linear relationships between variables by detecting the principal dimensions
of variability. As you will see, these conclusions will be supplemented by the
definition of the synthetic variables offered by PCA. It will therefore be eas-
ier to describe the data using a few synthetic variables rather than all of the
original variables.
In the example of the orange juice data, the correlation matrix (see Ta-
ble 1.3) brings together the 21 correlation coefficients. It is possible to group
Principal Component Analysis 5

the strongly correlated variables into sets, but even for this reduced number
of variables, grouping them this way is tedious.

TABLE 1.3
Orange Juice Data: Correlation Matrix
Odour Odour Pulp Intensity Acidity Bitter- Sweet-
intensity typicality of taste ness ness
Odour intensity 1.00 0.58 0.66 −0.27 −0.15 −0.15 0.23
Odour typicality 0.58 1.00 0.77 −0.62 −0.84 −0.88 0.92
Pulp content 0.66 0.77 1.00 −0.02 −0.47 −0.64 0.63
Intensity of taste −0.27 −0.62 −0.02 1.00 0.73 0.51 −0.57
Acidity −0.15 −0.84 −0.47 0.73 1.00 0.91 −0.90
Bitterness −0.15 −0.88 −0.64 0.51 0.91 1.00 −0.98
Sweetness 0.23 0.92 0.63 −0.57 −0.90 −0.98 1.00

1.2.3 Relationships between the Two Studies


The study of individuals and the study of variables are interdependent as
they are carried out on the same data table: studying them jointly can only
reinforce their respective interpretations.
If the study of individuals led to a distinction between groups of individ-
uals, it is then possible to list the individuals belonging to only one group.
However, for high numbers of individuals, it seems more pertinent to char-
acterise them directly by the variables at hand: for example, by specifying
that some orange juices are acidic and bitter whereas others have a high pulp
content.
Similarly, when there are groups of variables, it may not be easy to inter-
pret the relationships between many variables and we can make use of specific
individuals, that is, individuals who are extreme from the point of view of
these relationships. In this case, it must be possible to identify the individu-
als. For example, the link between acidity-bitterness can be illustrated by the
opposition between two extreme orange juices: Fresh Pampryl (orange juice
from Spain) versus Fresh Tropicana (orange juice from Florida).

1.3 Studying Individuals


1.3.1 The Cloud of Individuals
An individual is a row of the data table, that is, a set of K numerical values.
The individuals thus evolve within a space RK called “the individual’s space.”
If we endow this space with the usual Euclidean distance, the distance between
6 Exploratory Multivariate Analysis by Example Using R

two individuals i and l is expressed as


v
uK
uX
d(i, l) = t (xik − xlk )2 .
k=1

If two individuals have similar values within the table of all K variables, they
are also close in the space RK . Thus, the study of the data table can be
conducted geometrically by studying the distances between individuals. We
are therefore interested in all of the individuals in RK , that is, the cloud
of individuals (denoted NI ). Analysing the distances between individuals is
therefore tantamount to studying the shape of the cloud of points. Figure 1.3
illustrates a cloud of points within a space RK for K = 3.

FIGURE 1.3
Flight of a flock of starlings illustrating a scatterplot in RK .

The shape of the cloud NI remains the same even when translated. The
data are also centred, which corresponds to considering xik − x̄k rather than
xik . Geometrically, this is tantamount to coinciding the centre of mass of the
cloud GI (with coordinates x̄k for k = 1, ..., K) with the origin of reference (see
Figure 1.4). Centring presents technical advantages and is always conducted
in PCA.
The operation of reduction (also referred to as standardising), which con-
sists of considering (xik − x̄k )/sk rather than xik , modifies the shape of the
cloud by harmonising its variability in all the directions of the original vectors
(i.e., the K variables). Geometrically, it means choosing standard deviation
Principal Component Analysis 7

FIGURE 1.4
Scatterplot of the individuals in RK .

sk as a unit of measurement in direction k. This operation is essential if the


variables are not expressed in the same units. Even when the units of mea-
surement do not differ, this operation is generally preferable as it attaches
the same importance to each variable. Therefore, we will assume this to be
the case from here on in. Standardised PCA occurs when the variables are
centred and reduced, and unstandardised PCA when the variables are only
centred. When not otherwise specified, it may be assumed that we are using
standardised PCA.
Comment: Weighting Individuals
So far we have assumed that all individuals have the same weight. This applies
to almost all applications and is always assumed to be the case. Nevertheless,
generalisation with unspecified weights poses no conceptual or practical prob-
lems (double weight is equivalent to two identical individuals) and most soft-
ware packages, including FactoMineR, envisage this possibility (FactoMineR
is a package dedicated to Factor Analysis and Data Mining with R; see Sec-
tion A.2.3 in the appendix). For example, it may be useful to assign a different
weight to each individual after having rectified a sample. In all cases, it is
convenient to consider that the sum of the weights is equal to 1. If supposed
to be of the same weight, each individual will be assigned a weight of 1/I.

1.3.2 Fitting the Cloud of Individuals


1.3.2.1 Best Plane Representation of NI
The aim of PCA is to represent the cloud of points in a space with reduced
dimensions in an “optimal” manner, that is to say, by distorting the distances
8 Exploratory Multivariate Analysis by Example Using R

between individuals as little as possible. Figure 1.5 gives two representations


of three different fruits. The viewpoints chosen for the images of the fruits on
the top line make them difficult to identify. On the second row, the fruits can
be more easily recognised. What is it which differentiates the views of each
fruit between the first and the second lines? In the pictures on the second line,
the distances are less distorted and the representations take up more space
on the image. The image is a projection of a three-dimensional object in a
two-dimensional space.

FIGURE 1.5
Two-dimensional representations of fruits: from left to right, an avocado, a
melon, and a banana; each row corresponds to a different representation.

For a representation to be successful, it must select an appropriate view-


point. More generally, PCA means searching for the best representational
space (of reduced dimension) thus enabling optimal visualisation of the shape
of a cloud with K dimensions. We often use a plane representation alone,
which can prove inadequate when dealing with particularly complex data.
To obtain this representation, the cloud NI is projected on a plane of RK
denoted P , chosen in such a manner as to minimise distortion of the cloud
of points. Plane P is selected so that the distances between the projected
points might be as close as possible to the distances between the initial points.
Since, in projection, distances can only decrease, we try to make the projected
distances as high as possible. By denoting Hi the projection of the individual
i on plane P , the problem consists of finding P , with
I
X
OHi2 maximum.
i=1

The convention for notation uses mechanical terms: O is the centre of gravity,
OHi is a vector, and the criterion is the inertia of the projection of NI . The
criterion which consists of increasing the variance of the projected points to a
maximum is perfectly appropriate.
Principal Component Analysis 9
Remark
If the individuals are weighted with different weights pi , the maximised crite-
PI
rion is i=1 pi OHi2 .

In some rare cases, it might be interesting to search for the best axial
representation of cloud NI alone.
PI This best axis is obtained in the same way:
find the component u1 when i=1 OHi2 are maximum (where Hi is the pro-
jection of i on u1 ). It can be shown that plane P contains component u1 (the
“best” plane contains the “best” component): in this case, these representa-
tions are said to be nested. An illustration of this property is presented in
Figure 1.6. Planets, which are in a three-dimensional space, are traditionally
represented on a component. This component determines their positions as
well as possible in terms of their distances from one other (in terms of inertia
of the projected cloud). We can also represent planets on a plane according
to the same principle: to maximise the inertia of the projected scatterplot
(on the plane). This best plane representation also contains the best axial
representation.

ne
Su ury

s
r
rn

nu
te

tu
o
c
tu

ut
pi
n

ra

ep
er
Sa

Pl
Ju

U
M

N
M h
Ve s
s
rt
ar
nu
Ea

Uranus
Mars
Saturn Earth Sun
Mercury Venus Neptune

Jupiter

Pluto

FIGURE 1.6
The best axial representation is nested in the best plane representation of the
solar system (18 February 2008).

We define plane P by two nonlinear vectors chosen as follows: vector u1 ,


which defines the best axis (and which is included in P ), and vector u2 of
the plane P orthogonal to u1 . Vector u2 corresponds to the vector which
expresses the greatest variability of NI once that which is expressed by u1 is
10 Exploratory Multivariate Analysis by Example Using R

removed. In other words, the variability expressed by u2 is the best coupling


and is independent of that expressed by u1 .

1.3.2.2 Sequence of Axes for Representing NI


More generally, let us look for nested subspaces of dimensions s = 1 to S
so that each subspace is of maximum inertia for the P given dimension s. The
I 2
subspace of dimension s is obtained by maximising i=1 (OHi ) (where Hi
is the projection of i in the subspace of dimension s). As the subspaces
are nested, it is possible to choose vector us as the vector of the orthogonal
subspace for all of the vectors ut (with 1 ≤ t < s) which define the smaller
subspaces.
The first plane (defined by u1 , u2 ), i.e., the plane of best representation, is
often sufficient for visualising cloud NI . When S is greater than or equal to 3,
we may need to visualise cloud NI in the subspace of dimension S by using a
number of plane representations: the representation on (u1 , u2 ) but also that
on (u3 , u4 ) which is the most complementary to that on (u1 , u2 ). However, in
certain situations, we might choose to associate (u2 , u3 ), for example, in order
to highlight a particular phenomenon which appears on these two components.

1.3.2.3 How Are the Components Obtained?


Components in PCA are obtained through diagonalisation of the correlation
matrix which extracts the associated eigenvectors and eigenvalues. The eigen-
vectors correspond to vectors us which are each associated with the eigenvalues
of rank s (denoted λs ), as the eigenvalues are ranked in descending order. The
eigenvalue λs is interpreted as the inertia of cloud NI projected on the compo-
nent of rank s or, in other words, the “explained variance” for the component
of rank s.
If all of the eigenvectors are calculated (S = K), the PCA recreates a basis
for the space RK . In this sense, PCA can be seen as a change of basis in which
the first vectors of the new basis play an important role.

Remark
When variables are centred but not standardised, the matrix to be diago-
nalised is the variance–covariance matrix.

1.3.2.4 Example
The distance between two orange juices is calculated using their seven sensory
descriptors. We decided to standardise the data to attribute each descriptor
equal influence. Figure 1.7 is obtained from the first two components of the
PCA and corresponds to the best plane for representing the cloud of individu-
als in terms of projected inertia. The inertia projected on the plane is the sum
of two eigenvalues, that is, 86.82% (= 67.77% + 19.05%) of the total inertia
of the cloud of points.
The first principal component, that is, the principal axis of variability
Principal Component Analysis 11

Pampryl fr.

2
Dim 2 (19.05%)
1
Tropicana fr.
Fruvita fr.
0
Pampryl amb.
-1

Joker amb.
Tropicana amb.
-2

-4 -2 0 2 4
Dim 1 (67.77%)

FIGURE 1.7
Orange juice data: plane representation of the scatterplot of individuals.

between the orange juices, separates the two orange juices Tropicana fr. and
Pampryl amb. According to data Table 1.2, we can see that these orange
juices are the most extreme in terms of the descriptors odour typicality and
bitterness: Tropicana fr. is the most typical and the least bitter while Pampryl
amb. is the least typical and the most bitter. The second component, that
is, the property that separates the orange juices most significantly once the
main principal component of variability has been removed, identifies Tropicana
amb., which is the least intense in terms of odour, and Pampryl fr., which is
among the most intense (see Table 1.2).
Reading this data is tedious when there are a high number of individuals
and variables. For practical purposes, we will facilitate the characterisation
of the principal components by using the variables more directly.

1.3.3 Representation of the Variables as an Aid for


Interpreting the Cloud of Individuals
Let Fs denote the coordinate of the I individuals on component s and Fs (i)
its value for individual i. Vector Fs is also called the principal component of
rank s. Fs is of dimension I and thus can be considered as a variable. To
interpret the relative positions of the individuals on the component of rank s,
it may be interesting to calculate the correlation coefficient between vector Fs
and the initial variables. Thus, when the correlation coefficient between Fs
and a variable k is positive (or indeed negative), an individual with a positive
coordinate on component Fs will generally have a high (or low, respectively)
value (relative to the average) for variable k.
In the example, F1 is strongly positively correlated with the variables
odour typicality and sweetness and strongly negatively correlated with the
variables bitter and acidic (see Table 1.4). Thus Tropicana fr., which has the
12 Exploratory Multivariate Analysis by Example Using R

highest coordinate on component 1, has high values for odour typicality and
sweetness and low values for the variables acidic and bitter. Similarly, we
can examine the correlations between F2 and the variables. It may be noted
that the correlations are generally lower (in absolute value) than those with
the first principal component. We will see that this is directly linked to the
percentage of inertia associated with F2 which is, by construction, lower than
that associated with F1 . The second component can be characterised by the
variables odour intensity and pulp content (see Table 1.4).

TABLE 1.4
Orange Juice Data: Correlation between
Variables and First Two Components
F1 F2
Odour intensity 0.46 0.75
Odour typicality 0.99 0.13
Pulp content 0.72 0.62
Intensity of taste −0.65 0.43
Acidity −0.91 0.35
Bitterness −0.93 0.19
Sweetness 0.95 −0.16

To make these results easier to interpret, particularly in cases with a high


number of variables, it is possible to represent each variable on a graph, using
its correlation coefficients with F1 and F2 as coordinates (see Figure 1.8).
1.0

Odour intensity
0.62 Pulpiness
Dimension 2 (19.05%)

Intensity of taste
0.5

Acidity
Bitterness
Odour typicality
0.0

0.72
Sweetness
-0.5
-1.0

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5


Dimension 1 (67.77%)

FIGURE 1.8
Orange juice data: visualisation of the correlation coefficients between vari-
ables and the principal components F1 and F2 .

We can now interpret the joint representation of the cloud of individuals


with this representation of the variables.
Principal Component Analysis 13
Remark
A variable is always represented within a circle of radius 1 (circle represented
in Figure 1.8): indeed, it must be noted that F1 and F2 are orthogonal (in
the sense that their correlation coefficient is equal to 0) and that a variable
cannot be strongly related to two orthogonal components simultaneously. In
the following section we shall examine why the variable will always be found
within the circle of radius 1.

1.4 Studying Variables


1.4.1 The Cloud of Variables
Let us now consider the data table as a set of columns. A variable is one of the
columns in the table, that is, a set of I numerical values, which is represented
by a point of the vector space with I dimensions, denoted RI (and known as
the “variables’ space”). The vector connects the origin of RI to the point. All
of these vectors constitute the cloud of variables and this ensemble is denoted
NK (see Figure 1.9).

O
1

FIGURE 1.9
The scatterplot of the variables NK in RI . In the case of a standardised PCA,
the variables k are located within a hypersphere of radius 1.

The scalar product between two variables k and l is expressed as


I
X
xik × xil = kkk × klk × cos(θkl )
i=1
14 Exploratory Multivariate Analysis by Example Using R

with kkk and klk the norm for variable k and l, and θkl the angle produced by
the vectors representing variables k and l. Since the variables used here are
centred, the norm for one variable is equal to its standard deviation multiplied
by the square root of I, and the scalar product is expressed as follows:
I
X
(xik − x̄k ) × (xil − x̄l ) = I × sk × sl × cos(θkl ).
i=1

On the right-hand side of the equation, we can identify covariance between


variables k and l.
Similarly, by dividing each term in the equation by the standard deviations
sk and sl of variables k and l, we obtain the following relationship:

r(k, l) = cos(θkl ).

This property is essential in PCA as it provides a geometric interpretation of


the correlation. Therefore, in the same way as the representation of cloud NI
can be used to visualise the variability between individuals, a representation of
the cloud NK can be used to visualise all of the correlations (through the angles
between variables) or, in other words, the correlation matrix. To facilitate
visualisation of the angles between variables, the variables are represented by
vectors rather than points.
Generally speaking, the variables being centred and reduced (scaled to
unit variance) have a length with a value of 1 (hence the term “standardised
variable”). The vector extremities are therefore on the sphere of radius 1 (also
called “hypersphere” to highlight the fact that, in general, I > 3), as shown
in Figure 1.9.
Comment about the Centring
In RK , when the variables are centred, the origin of the axes is translated
onto the mean point. This property is not true for NK .

1.4.2 Fitting the Cloud of Variables


As is the case for the individuals, the cloud of variables NK is situated in a
space RI with a great number of dimensions and it is impossible to visualise the
cloud in the overall space. The cloud of variables must therefore be adjusted
using the same strategy
PK as for the cloud of individuals. We maximise an
2
equivalent criterion k=1 (OHk ) with Hk , the projection of variable k on
the subspace with reduced dimensions. Here too, the subspaces are nested
and we can identify a series of orthogonal axes S which define the subspaces
for dimensions s = 1 to S. Vector vs therefore belongs to a given subspace
and is orthogonal to the vectors vt which make up the PKsmaller subspaces. It
can therefore be shown that the vector vs maximises k=1 (OHks )2 where Hks
is the projection of variable k on vs .
Principal Component Analysis 15
Remark
In the individual space RK , centring the variables causes the origin of the
axes to shift to a mean point: the maximised criterion is therefore interpreted
as a variance; the projected points must be as dispersed as possible. In RI ,
centring has a different effect, as the origin is not the same as the mean point.
The projected points should be as far as possible from the origin (although not
necessarily dispersed), even if that means being grouped together or merged.
This indicates that the position of the cloud NK with respect to the origin is
important.

Vectors vs (s = 1, ..., S) belong to the space RI and consequently can be


considered new variables. The correlation coefficient r(k, vs ) between variables
k and vs is equal to the cosine of the angle θks between Ok and vs when variable
k is centred and scaled, and thus standardised. The plane representation
constructed by (v1 , v2 ) is therefore pleasing as the coordinates of a variable k
correspond to the cosine of the angle θk1 and that of angle θk2 , and thus the
correlation coefficients between variables k and v1 , and between variables k
and v2 . In a plane representation such as this, we can therefore immediately
visualise whether or not a variable k is related to a dimension of variability
(see Figure 1.10).
PK 2
By their very construction, variables vs maximise criterion k=1 (OHks ) .
s
Since the projection of a variable k is equal to the cosine of angle θk , the
criterion maximises
XK K
X
cos2 θks = r2 (k, vs ).
k=1 k=1

The above expression illustrates that vs is the new variable which is the most
strongly correlated with all of the initial variables K (with the orthogonality
constraint of vt already found). As a result, vs can be said to be a synthetic
variable. Here, we are experiencing the second aspect of the study of variables
(see Section 1.2.2).
Remark
When a variable is not standardised, its length is equal to its standard deviation.
In an unstandardised PCA, the criterion can be expressed as follows:
K
X K
X
2
(OHks ) = s2k r2 (k, vs ) .
k=1 k=1

This highlights the fact that, in the case of an unstandardised PCA, each
variable k is assigned a weight equal to its variance s2k .

It can be shown that the axes of representation NK are in fact eigenvec-


tors of the scalar products matrix between individuals. This property is, in
practice, only used when the number of variables exceeds the number of in-
dividuals. We will see in the following that these eigenvectors are deducted
from those of the correlation matrix.
16 Exploratory Multivariate Analysis by Example Using R

A
HA
HB
HA
D HB HD

HD HC
HC

FIGURE 1.10
Projection of the scatterplot of the variables on the main plane of variabil-
ity. On the left: visualisation in space RI ; on the right: visualisation of the
projections in the principal plane.

The best plane representation of the cloud of variables corresponds exactly


to the graph representing the variables obtained as an aid to interpreting the
representation of individuals (see Figure 1.8). This remarkable property is not
specific to the example but applies when carrying out any standardised PCA.
This point will be developed further in the following section.

1.5 Relationships between the Two Representations NI


and NK
So far we have looked for representations of NI and NK according to the same
principle and from one single data table. It therefore seems natural for these
two analyses (NI in RK and NK in RI ) to be related.
The relationships between the two clouds NI and NK are brought to-
gether under the general term of “relations of duality.” This term refers to
the dual approach of one single data table, by considering either the lines or
the columns. This approach is also defined by “transition relations” (calcu-
lating the coordinates in one space from those in the other). Where Fs (i) is
the coordinate of individual i and Gs (k) the coordinate of variable k of the
component of rank s, we obtain the following equations:

K
1 X
Fs (i) = √ xik Gs (k),
λs k=1
Principal Component Analysis 17
I
1 X
Gs (k) = √ (1/I) xik Fs (i).
λs i=1

This result is essential for interpreting the data, and makes PCA a rich
and reliable experimental tool. This may be expressed as follows: individuals
are on the same side as their corresponding variables with high values, and
opposite their corresponding variables with low values. It must be noted that
xik are centred and carry both positive and negative values. This is one of the
reasons why individuals can be so far from the variable for which they carry
low values. Fs is referred to as the principal component of rank s; λs is the
variance of Fs and its square root the length of Fs in RI ; vs is known as the
standardised principal component.
The total inertias of both clouds are equal (and equal to K for standardised
PCA) and furthermore, when decomposed component by component, they
are identical. This property is remarkable: if S dimensions are enough to
perfectly represent NI , the same is true for NK . In this case, two dimensions
are sufficient. If not, why generate a third synthetic variable which would not
differentiate the individuals at all?

1.6 Interpreting the Data


1.6.1 Numerical Indicators
1.6.1.1 Percentage of Inertia Associated with a Component
The first indicators that we shall examine are the ratios between the projected
inertias and the total inertia. For component s:
PI 1 s 2 PK s 2
i=1 I (OHi ) k=1 (OHk ) λs
PI 1 2
= P K
= PK .
2
i=1 I (Oi) k=1 Ok s=1 λs
PK
In the most usual case, when the PCA is standardised, s=1 λs = K.
When multiplied by 100, this indicator gives the percentage of inertia (of NI
in RK or of NK in RI ) expressed by the component of rank s. This can be
interpreted in two ways:
1. As a measure of the quality of data representation; in the example,
we say that the first component expresses 67.77% of data variability
(see Table 1.5). In a standardised PCA (where I > K), we often
compare λs with 1, the value below which the component of rank
s, representing less data than a stand-alone variable, is not worthy
of interest.
2. As a measure of the relative importance of the components; in the
18 Exploratory Multivariate Analysis by Example Using R

example, we say that the first component expresses three times more
variability than the second; it affects three times more variables but
this formulation is truly precise only when each variable is perfectly
correlated with a component.
Because of the orthogonality of the axes (both in RK and in RI ), these iner-
tia percentages can be added together for several components; in the example,
86.82% of the data are represented by the first two components (67.77% +
19.05% = 86.82%).

TABLE 1.5
Orange Juice Data: Decomposition of Variability
per Component
Eigenvalue Percentage of Cumulative
variance of variance percentage
Comp 1 4.74 67.77 67.77
Comp 2 1.33 19.05 86.81
Comp 3 0.82 11.71 98.53
Comp 4 0.08 1.20 99.73
Comp 5 0.02 0.27 100.00

Let us return to Figure 1.5: the pictures of the fruits on the first line cor-
respond approximately to a projection of the fruits on the plane constructed
by components 2 and 3 of PCA, whereas the images on the second line cor-
respond to a projection on plane 1-2. This is why the fruits are easier to
recognise on the second line: the more variability (i.e., the more information)
collected on plane 1-2 when compared to plane 2-3, the easier it is to grasp
the overall shape of the cloud. Furthermore, the banana is easier to recognise
in plane 1-2 (the second line), as it retrieves greater inertia on plane 1-2. In
concrete terms, as the banana is a longer fruit than a melon, this leads to
more marked differences in inertia between the components. As the melon is
almost spherical, the percentages of inertia associated with each of the three
components are around 33% and therefore the inertia retrieved by plane 1-2
is nearly 66% (as is that retrieved by plane 2-3).

1.6.1.2 Quality of Representation of an Individual or Variable


The quality of representation of an individual i on the component s can be
measured by the distance between the point within the space and the projec-
tion on the component. In reality, it is preferable to calculate the percentage
of inertia of the individual i projected on the component s. Therefore, when
θis is the angle between Oi and us , we obtain

Projected inertia of i on us
qlts (i) = = cos2 θis .
Total inertia of i
Using Pythagoras’ theorem, this indicator is combined for multiple compo-
nents and is most often calculated for a plane.
Principal Component Analysis 19

The quality of representation of a variable k on the component of rank s


is expressed as
Projected inertia of k on vs
qlts (k) = = cos2 θks .
Total inertia of k
This last quantity is equal to r2 (k, vs ), which is why the quality of represen-
tation of a variable is only very rarely provided by software. The representa-
tional quality of a variable in a given plane is obtained directly on the graph
by visually evaluating its distance from the circle of radius 1.

1.6.1.3 Detecting Outliers


Analysing the shape of the cloud NI also means detecting unusual or remark-
able individuals. An individual is considered remarkable if it has extreme
values for multiple variables. In the cloud NI , an individual such as this is far
from the cloud’s centre of gravity, and its remarkable nature can be evaluated
from its distance from the centre of the cloud in the overall space RK .
In the example, none of the orange juices are particularly extreme (see Ta-
ble 1.6). The two most extreme individuals are Tropicana fresh and Pampryl
ambient.
TABLE 1.6
Orange Juice Data: Distances from the Individuals to the Centre of the
Cloud
Pampryl amb. Tropicana amb. Fruvita fr. Joker amb. Tropicana fr. Pampryl fr.
3.03 1.98 2.59 2.09 3.51 2.34

1.6.1.4 Contribution of an Individual or Variable to the


Construction of a Component
Outliers have an influence on analysis, and it is interesting to know to what
extent their influence affects the construction of the components. Further-
more, some individuals can influence the construction of certain components
without being remarkable themselves. Detecting those individuals that con-
tribute to the construction of a principal component helps to evaluate the
component’s stability. It is also interesting to evaluate the contribution of
variables in constructing a component (especially in nonstandardised PCA).
To do so, we decompose the inertia of a component individual by individual
(or variable by variable). The inertia explained by the individual i on the
component s is
2
(1/I) (OHis )
× 100.
λs
Distances intervene in the components by their squares, which augments the
roles of those individuals at a greater distance from the origin. Outlying
20 Exploratory Multivariate Analysis by Example Using R

individuals are the most extreme on the component, and their contributions
are especially useful when the individuals’ weights are different.
Remark
These contributions are combined for multiple individuals.

When an individual contributes significantly (i.e., much more than the


others) to the construction of a principal component (for example, Tropicana
ambient and Pampryl fresh; for the second component, see Table 1.7), it is
not uncommon for the results of a new PCA constructed without this indi-
vidual to change substantially: the principal components can change and new
oppositions between individuals may appear.

TABLE 1.7
Orange Juice Data: Contribution of
Individuals to the Construction of the
Components
Dim.1 Dim.2
Pampryl amb. 31.29 0.08
Tropicana amb. 2.76 36.77
Fruvita fr. 13.18 0.02
Joker amb. 12.63 8.69
Tropicana fr. 35.66 4.33
Pampryl fr. 4.48 50.10

Similarly, the contribution of variable k to the construction of component


s is calculated. An example of this is presented in Table 1.8.

TABLE 1.8
Orange Juice Data: Contribution of Variables to the
Construction of the Components
Dim.1 Dim.2
Odour intensity 4.45 42.69
Odour typicality 20.47 1.35
Pulp content 10.98 28.52
Taste intensity 8.90 13.80
Acidity 17.56 9.10
Bitterness 18.42 2.65
Sweetness 19.22 1.89

1.6.2 Supplementary Elements


We here define the concept of active and supplementary (or illustrative) el-
ements. By definition, active elements contribute to the construction of the
principal components, contrary to supplementary elements. Thus, the inertia
of the cloud of individuals is calculated on the basis of active individuals, and
similarly, the inertia of the cloud of variables is calculated on the basis of
active variables. The supplementary elements make it possible to illustrate
Principal Component Analysis 21

the principal components, which is why they are referred to as “illustrative


elements.” Contrary to the active elements, which must be homogeneous, we
can make use of as many illustrative elements as possible.

1.6.2.1 Representing Supplementary Quantitative Variables


By definition, a supplementary quantitative variable plays no role in calcu-
lating the distances between individuals. They are represented in the same
way as active variables: to assist in interpreting the cloud of individuals (Sec-
tion 1.3.3). The coordinate of the supplementary variable k 0 on the component
s corresponds to the correlation coefficient between k 0 and the principal com-
ponent s (i.e., the variable whose values are the coordinates of the individuals
projected on the component of rank s). k 0 can therefore be represented on
the same graph as the active variables.
More formally, the transition formulae can be used to calculate the coor-
dinate of the supplementary variable k 0 on the component of rank s:

1 X
Gs (k 0 ) = √ xik0 Fs (i) = r(k, Fs ),
λs
i∈{active}

where {active} refers to the set of active individuals. This coordinate is cal-
culated from the active individuals alone.
In the example, in addition to the sensory descriptors, there are also physic-
ochemical variables at our disposal (see Table 1.9). However, our stance re-
mains unchanged, namely, to describe the orange juices based on their sensory
profiles. This problem can be enriched using the supplementary variables since
we can now link sensory dimensions to the physicochemical variables.

TABLE 1.9
Orange Juice Data: Supplementary Variables
Glucose Fructose Saccharose Sweetening pH Citric Vitamin C
power acid
Pampryl amb. 25.32 27.36 36.45 89.95 3.59 0.84 43.44
Tropicana amb. 17.33 20.00 44.15 82.55 3.89 0.67 32.70
Fruvita fr. 23.65 25.65 52.12 102.22 3.85 0.69 37.00
Joker amb. 32.42 34.54 22.92 90.71 3.60 0.95 36.60
Tropicana fr. 22.70 25.32 45.80 94.87 3.82 0.71 39.50
Pampryl fr. 27.16 29.48 38.94 96.51 3.68 0.74 27.00

The correlations circle (Figure 1.11) represents both the active and sup-
plementary variables. The main component of variability opposes the orange
juices perceived as acidic/bitter, slightly sweet and somewhat typical with the
orange juices perceived as sweet, typical, not very acidic and slightly bitter.
The analysis of this sensory perception is reinforced by the variables pH and
saccharose. Indeed, these two variables are positively correlated with the first
component and lie on the side of the orange juices perceived as sweet and
22 Exploratory Multivariate Analysis by Example Using R

slightly acidic (a high pH index indicates low acidity). One also finds the re-
action known as “saccharose inversion” (or hydrolysis): the saccharose breaks
down into glucose and fructose in an acidic environment (the acidic orange
juices thus contain more fructose and glucose, and less saccharose than the
average).
1.0

Odour.intensity
Sweetening.power
Pulpiness

Intensity.of.taste
0.5

Acidity Fructose
Glucose Saccharose
Dim 2 (19.05%)

Bitterness
Odour.typicality
0.0

Citric.acid Sweetness
pH
Vitamin.C
-0.5
-1.0

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

Dim 1 (67.77%)

FIGURE 1.11
Orange juice data: representation of the active and supplementary variables.

Remark
When using PCA to explore data prior to a multiple regression, it is advisable
to choose the explanatory variables for the regression model as active variables
for PCA, and to project the variable to be explained (the dependent variable)
as a supplementary variable. This gives some idea of the relationships between
explanatory variables and thus of the need to select explanatory variables.
This also gives us an idea of the quality of the regression: if the dependent
variable is appropriately projected, it will be a well-fitted model.

1.6.2.2 Representing Supplementary Categorical Variables


In PCA, the active variables are quantitative by nature but it is possible to
use information resulting from categorical variables on a purely illustrative
basis (= supplementary), that is, not used to calculate the distances between
individuals.
The categorical variables cannot be represented in the same way as the
supplementary quantitative variables since it is impossible to calculate the
correlation between a categorical variable and Fs . Information about cate-
gorical variables lies within their categories. It is quite natural to represent
a categorical variable at the barycentre of all the individuals possessing that
variable. Thus, following projection on the plane defined by the principal
Principal Component Analysis 23

components, these categories remain at the barycentre of the individuals in


their plane representation. A categorical variable can thus be regarded as the
mean individual obtained from the set of individuals who have it. This is
therefore the way in which it will be represented on the graph of individuals.
The information resulting from a supplementary categorical variable can
also be represented using a colour code: all of the individuals with the same
categorical variable are coloured in the same way. This facilitates visualisation
of dispersion around the barycentres associated with specific categories.
In the example, we can introduce the variable way of preserving which has
two categories ambient and fresh as well as the variable origin of the fruit juice
which has also two categories Florida and Other (see Table 1.10). It seems
that sensory perception of the products differs according to their packaging
(despite the fact that they were all tasted at the same temperature). The
second bisectrix separates the products purchased in the chilled section of the
supermarket from the others (see Figure 1.12).

TABLE 1.10
Orange Juice Data: Supplementary
Categorical Variables
Way of Origin
preserving
Pampryl amb. Ambient Other
Tropicana amb. Ambient Florida
Fruvita fr. Fresh Florida
Joker amb. Ambient Other
Tropicana fr. Fresh Florida
Pampryl fr. Fresh Other

Pampryl fr.
2
Dim 2 (19.05%)

Fresh
1

Tropicana fr.
Other
Fruvita fr.
0

Pampryl amb.
Ambient Florida
-1

Joker amb.
Tropicana amb.
-2

-4 -2 0 2 4
Dim 1 (67.77%)

FIGURE 1.12
Orange juice data: plane representation of the scatterplot of individuals with
a supplementary categorical variable.
24 Exploratory Multivariate Analysis by Example Using R

1.6.2.3 Representing Supplementary Individuals


Just as for the variables, we can use the transition formula to calculate the
coordinate of a supplementary individual i0 on the component of rank s:

K
1 X
Fs (i0 ) = √ xi0 k Gs (k).
λs k=1

Note that centring and standardising (if any) are conducted with respect to
the averages and the standard deviations calculated from the active individuals
only. Moreover, the coordinate of i0 is calculated from the active variables
alone. Therefore, it is not necessary to have the values of the supplementary
individuals for the supplementary variables.
Comment
A supplementary categorical variable can be regarded as a supplementary
individual which, for each active variable, would take the average calculated
from all of the individuals with this categorical variable.

1.6.3 Automatic Description of the Components


The components provided by the principal component method can be de-
scribed automatically by all of the variables, whether quantitative or categor-
ical, supplementary or active.
For a quantitative variable, the principle is the same whether the variable
is active or supplementary. First, the correlation coefficient between the coor-
dinates of the individuals on the component s and each variable is calculated.
We then sort the variables in descending order from the highest coefficient to
the weakest and retain the variables with the highest correlation coefficients
(absolute values).
Comment
Let us recall that principal components are linear combinations of the active
variables, as are synthetic variables. Testing the significance of the correla-
tion coefficient between a component and a variable is thus a biased procedure
by its very construction. However, it is useful to sort and select the active
variables in this manner to describe the components. On the other hand, for
the supplementary variables, the test described corresponds to that tradition-
ally used to test the significance of the correlation coefficient between two
variables.
For a categorical variable, we conduct a one-way analysis of variance where
we seek to explain the coordinates of the individuals (on the component
P of
rank s) by the categorical variable; we use the sum to zero contrasts i αi = 0.
Then, for each categorical variable, a Student t-test is conducted to compare
the average of the individuals who possess that category with the general
average (we test αi = 0 considering that the variances of the coordinates are
Principal Component Analysis 25

equal for each category). The correlation coefficients are sorted according to
the p-values in descending order for the positive coefficients and in ascending
order for the negative coefficients.
These tips for interpreting such data are particularly useful for understand-
ing those dimensions with a high number of variables.
The data used is made up of few variables. We shall nonetheless give the
outputs of the automatic description procedure for the first component as
an example. The variables which best characterise component 1 are odour
typicality, sweetness, bitterness, and acidity (see Table 1.11).

TABLE 1.11
Orange Juice Data: Description of the First Dimension
by the Quantitative Variables
Correlation p-value
Odour typicality 0.9854 0.0003
Sweetness 0.9549 0.0030
pH 0.8797 0.0208
Aciditiy −0.9127 0.0111
Bitterness −0.9348 0.0062

The first component is also characterised by the categorical variable Origin


as the correlation is significantly different from 0 (p-value = 0.00941; see the
result in the object quali in Table 1.12); the coordinates of the orange juices
from Florida are significantly higher than average on the first component,
whereas the coordinates of the other orange juices are lower than average (see
the results in the object category in Table 1.12).

TABLE 1.12
Orange Juice Data: Description of the First
Dimension by the Categorical Variables and
the Categories of These Categorical Variables
$Dim.1$quali
R2 p-value
Origin 0.8458 0.0094

$Dim.1$category
Estimate p-value
Florida 2.0031 0.0094
Other −2.0031 0.0094

1.7 Implementation with FactoMineR


In this section, we will explain how to carry out a PCA with FactoMineR
and how to find the results obtained with the orange juice data. First, load
Exploring the Variety of Random
Documents with Different Content
The Hulett Unloader. furnaces and deposit it in cement ore
Note the Operator’s troughs awaiting further journey to the
Head in White Spot Just ore house, from which it goes to the
Above the Grab furnaces. The empty ore boat
immediately coals with whole car loads
of the fuel dumped into chutes leading
to her bunkers by the car dumper and proceeds on her way back to
the mines for another cargo of ore. The round trip, including loading
and unloading, requires but seven days.
As in many other lines of
commercial endeavor of to-day, speed
and large tonnage have been the aim
and it would seem that in ore handling
and conveying devices the limit has
about been reached. The big steam
shovels, gravity docks, ore tanks or
boats, and unloading and coaling
devices, with the low cost of water A Good View of the
transportation have made our modern Hatch System of Modern
iron and steel preëminence possible. Ore Boats and Four
To show the importance to us of this Hulett Unloaders at
water transportation, we might Work
mention that the rate for carrying ore
from Lake Superior ore ports to the
Lake Erie furnaces has been as low as $.0007 per ton per mile while
the transportation cost by way of a well operated railway at that
particular time was more than $.005 per ton per mile—more than
seven times as much.
The Rehandling Bridge with Stock Ore Pile and Blast Furnace at Rear

Though for some years past more than three-quarters of all of the
iron ore used in the United States has come from seven or eight
mines in the northern peninsula of Michigan and the adjacent part of
Minnesota, it must not be understood that the Lake Superior mines
are the only ore deposits in this country. Figures show that such an
inference is far from the truth. It is true, however, that they have
made the United States what it is, the leading iron producer of the
world. There are still immense quantities to be mined on the Lake
Superior ranges. Their heavy production of cheaply handled high
grade ore has, of course, held back development of other districts,
which also have great natural resources. The Birmingham, Ala.,
region for instance, is a great ore and iron producer, right now
producing the third largest tonnage of any district in the country.
Some time in the not far off future, Alabama with her great deposits
of iron ore, coal, and other natural resources is going to announce
herself in no small voice. New York, Pennsylvania, Tennessee, and
Virginia rank next after Minnesota, Michigan, and Alabama as ore
producers, and several other states of the Union are not paupers in
resources of iron ore.
We should not get so enthusiastic
over our ore supply and iron
production as to think that other
countries are devoid of such material.
Almost every civilized country has ore
enough that it does pretty well. With
many the trouble is that the ore has
objectionable constituents or that
Hulett Grab Buckets in supply of cheap fuel is not available.
the Hold of an Ore Boat Germany has large deposits of iron ore,
but until the invention of the basic
Bessemer process about 1870 she was
handicapped because of the high phosphorus content of her ore. The
basic processes, both Bessemer and open-hearth, allow of the
removal of this phosphorus during the conversion into steel, and
they therefore brought Germany to the front as an iron producer.
The excellence of Sweden’s iron and steel has long been known the
world over. Sweden produces approximately one per cent of the
world’s total production of iron and steel, but her ore has been of
such high grade that iron made from it has maintained its position as
a standard for use in the manufacture of highest grade crucible
steels. The very finest steels for cutlery and tools, and even the softer
grades of steel of northwestern Europe, have been made from
Swedish iron as a base.
Iron ore, of course, is classified by geologists and chemists into
varieties with such names as hematite, magnetite, siderite, etc.,
which here little concern us.
To be worked at a profit, the iron content of the ore must be high
with the smallest possible amounts of undesirable impurities,
particularly phosphorus, sulphur, and silica. There are, however,
certain impurities which are not undesirable, for instance, lime,
which will act as a flux and neutralize the effect of some of the
undesirable impurities. For these reasons the prices for iron ore are
based on the iron content and modified by the relative amounts of
undesirable and desirable impurities. Phosphorus is almost a
domineering factor and at present approximately fifty cents a ton
more is paid for Bessemer ore (that containing less than .050 per
cent phosphorus) than for non-Bessemer ore. As might be expected
the best ores have been the first used and the grade is constantly
falling. Instead of the 66 per cent iron ores of some years ago those
coming nowadays contain not much more than 59 per cent of iron
and the Bessemer ores described above are getting scarcer, so that
for some years practically all of the furnaces have been mixing with
them as much higher phosphorus ore as could be used without
pushing the phosphorus content of the mixture over the allowable
limit.
We often hear people surmising what is to become of us when all
of the iron ore of this planet has been used. There is no harm in
taking stock of resources and in this case it does us much good. It
happens that each time the count is taken of iron ore available and
that which under future and better methods of working can be
utilized, we find ourselves immensely better off than the previous
report had made out and we have less cause to worry about the
future. The last inventory was taken by the extremely ambitious
International Geological Congress held at Stockholm, Sweden, in
1910. It shows that the world yet has enough rich ore to make
10,192,000,000 tons of iron, and, a further supply of ore for
53,136,000,000 tons of iron, which could be used if necessary.
So we will get along for a while yet.
CHAPTER III
THE OTHER RAW MATERIALS

Since the beginnings of iron


manufacture, charcoal has been a
favorite fuel. Though during the past
two centuries coke has grown to be the
standard, with anthracite and some
few bituminous coals finding use in
certain favored localities, charcoal may Old Charcoal Kilns,
be considered the fuel which developed near Negaunee, Mich.
the iron industry, at least until recent
years.
Charcoal
As most of us know, charcoal is completely charred wood, usually
hard wood, though sometimes resinous or other soft woods are used.
Of well-dried timber more than 50 per cent by weight is moisture.
This and certain other constituents are driven off by heat in the
absence of air, which process is usually called “destructive
distillation.”
By primitive methods a considerable part of the wood was
completely burned and wasted during the production of charcoal.
Stacked in piles or long rows the cut wood was well covered with
earth, except for a small opening at the top through which the fire
was lighted down a center cavity left to the bottom of the pile. The air
coming in through the opening at the top was sufficient to keep the
wood smoldering. After a period, which had been shown by
experience to give the best results, the opening was closed and the
fire smothered.
Brick ovens of the beehive shape
were built at a later date where
considerable charcoal was to be made.
These were operated on much the
same general principle as the meillers
or earth-covered piles, described
above. The fire was lighted at the
Beehive Coke Ovens bottom of the central cavity of the
corded wood, the only air at first
coming from the top, though later in
the process a little was admitted through holes in the walls. After
about ten days, when gas ceased to come off, the kiln was tightly
closed for a period of twenty days more for the fire to die out and the
charcoal to cool.
By both of these processes valuable constituents were burned or
driven off by the heat and lost. These were mainly methyl alcohol,
acetic acid, and wood tar.
Modern industry so emphatically disapproves of any waste of
materials that apparatus has been devised to produce charcoal which
allows of recovery of the by-products at the same time. In northern
Michigan, which is practically the only district in the United States in
which the charcoal industry as an industry still survives, long steel
tubes or retorts are built with brick fire-boxes under each end, much
as a stationary boiler is set. Into these retorts are run steel cars
loaded with the wood. The retorts being closed, the heat drives or
distills off the moisture and gaseous compounds through pipes
connecting them with condensing apparatus. After about twenty
hours the wood has been charred, the doors of the kilns are suddenly
opened and the cars are rushed into other and similar retorts for
cooling, while fresh loads of wood replace them in the first.

Standard Beehive Coke Oven

As may be surmised vast quantities of wood and of wood-


producing land are required for extensive charcoal manufacture, and
this is the most serious problem for the manufacturer of charcoal.
Several square miles of timber land must be cut over each year and
the wood efficiently transported in
order to operate a large plant
profitably.
Pig iron as a by-product is a rather
novel idea, but that is practically what
the charcoal pig iron produced in our
Lake Superior region is. Several
companies operate wood distillation
Beehive Ovens
plants for the production of methyl
alcohol, acetic acid, acetate of lime,
etc., and use their charcoal in the manufacture of charcoal pig iron
from the ores so close at hand.
The very low sulphur content and the small amount of ash have
been the great advantages possessed by charcoal over other solid
fuels. Resulting characteristics made charcoal pig iron a former
favorite for manufacture of certain articles such as chilled car wheels,
etc., and it, therefore, brought a higher price than coke pig iron.
During recent years, however, by careful selection of coal and
improvements in the coking process the sulphur and ash of coke
have been so reduced that charcoal has not so great an advantage as
formerly. Charcoal iron to-day brings only about $1.50 per ton more
than coke iron; whereas, the differential a few years ago was as great
as $5.00 or $6.00 per ton.
Charcoal is quite fragile and structurally weak, so much so that
blast furnaces for its use cannot be built higher than sixty feet;
whereas, the great strength of coke allows them to be built to exceed
one hundred feet in height with correspondingly increased output.
What this means may be realized by every one conversant with the
demands of modern industry.
Coke
As charcoal is completely charred
wood, so coke for analogy’s sake may
be said to be completely charred coal,
practically always of the bituminous
type. By “baking” bituminous coal at a
cherry-red heat, its volatile
constituents are driven off as the well-
known “coal-gas” of almost every small
town, and a strong, brittle and porous
material or coke residue is left. If the
baking is done without any admission Charging Coal into the
of air to the retort, practically none of Ovens
the coal burns and the “cake” or coke
which is left contains the ash of the
original coal and what is known as the “fixed carbon,” i.e., carbon
which cannot be distilled or driven off by heat alone, though it would
burn were air admitted.
The gases or volatile constituents which are given off consist
mainly of moisture and a mixture of gaseous chemical compounds,
which are known as “hydro-carbons.” These contain that part of the
carbon of the original coal which does not remain as “fixed carbon”
in the coke.
Just why some coals will coke while
others of apparently the same
composition as shown by the chemist’s
analyses, will not, but instead of the
hard brittle mass will leave a heap of
brown or black powder, is not as yet
definitely known. It is easy enough for
chemists to determine with accuracy
the amounts of hydrogen, nitrogen,
oxygen, carbon, sulphur, and other
elements; but it is a difficult and
Quenching after Coal perhaps an impossible matter to
Has Been Coked determine just how these elements are
“hitched up” in the very complex
mineral, coal,—one of the most
complex substances which we know.
Various theories have been advanced in the attempt to explain the
coking quality. A bulletin of the United States Geological Survey
claims that the relative percentages of hydrogen and oxygen in the
coal determines it; others have held that it depends upon the
compounds of a tarry or asphaltic nature present. The fact remains
that some coals coke without trouble, while others do not coke at all.
As yet the only real way to tell whether a new variety of coal will or
will not coke is to try it.
Since 1713, when Abraham Darby in England succeeded in
introducing it as a substitute for the fast disappearing charcoal for
use in blast furnaces, coke has become the standard fuel. It is very
strong and will bear up under the great weight of iron ore and
limestone with which the furnace is charged. So furnaces for use with
coke may be built much larger than those in which charcoal is to be
the fuel. The porous nature of coke allows it to burn rapidly with
intense heat, so that the output of an iron works is greatly increased
through its use—a very desirable thing in these days of big things. It
has its disadvantages, of course, mainly high sulphur, a deleterious
substance for which molten iron, unfortunately, has a voracious
appetite, and a rather high percentage of ash which must be fluxed
out. But all in all, it is a very desirable fuel for blast furnace and other
metallurgical purposes, as is shown by the fact that it is used in the
production of about ninety-nine per cent of all iron and steel now
made.
What is known as the Appalachian
coal region produces coal for more
than seventy-five per cent of the coke
made in the United States. This region
includes the strip of territory
extending from Western Pennsylvania
and Ohio down to Tennessee, Georgia,
and Alabama. The famous
Connellsville district is a part of this Drawing the Coke
region.
Illinois and Indiana have a great
deal of coal, which, however, has rather indifferent coking qualities.
Almost constant experimentation has been carried on in the attempt
to induce these semi-coking coals to coke. The best that has so far
developed is the use of a considerable percentage of them in
admixture with coals of good coking qualities. Such mixtures yield
quite satisfactory coke.
The Beehive Oven Process
In the old days there was no desire
or incentive to avoid waste of coal
resources. If during the coking process
some air got into the oven and part of
the coal was burned, or if all of the gas
given off was wasted, it did not matter.
There was plenty more of coal and the
thing desired was to get the requisite
coke in the quickest and cheapest way.
In
Large Pieces of Coke Western
Pennsylv
ania,
Ohio, and Virginia, were great beds of
high grade coking coal. In this region
and particularly around Pittsburg,
numerous blast furnaces and steel
mills grew up. The coke for these was Where Coals Are
made in the most convenient way—in Pulverized and Mixed
the wasteful beehive ovens. for Coking
As the
name signifies, these ovens or retorts
are brick chambers shaped like
beehives. In the larger plants they are
built either in single rows against long
hills or in double rows back to back.
Over the tops of the ovens in each row
Battery of By-Product
runs a car called a charging “lorry.”
Coke Ovens, Showing
Coal is poured from the bottom of this
Gas-collecting Main
through a hole in the top of each oven
while it is still hot from the preceding
charge. No air gets in except that admitted through the hole in the
oven top and a small slit left over the one side door, through which
the coke is drawn when the coking process is finished. The heat of
the oven starts the distillation of the moisture and the volatile
compounds which escape through the hole in the oven top. The small
amount of air admitted burns a little of the coal and gas and raises
the temperature of the oven to that required for coking.
After 48 or 72 hours a spray of water is thrown in over the glowing
coal to quench the fire. The partially cooled coke is drawn through
the open door, sorted and loaded into cars for shipment.
Though this method of coking is a very wasteful one, it yet
produces the larger quantity of the coke made in the United States.
However, conditions are rapidly changing and it will not be many
years before the much less wasteful “by-product” process gains the
ascendency. By 1914 it had already come to produce about twenty-
five per cent of the total coke made here, and since that date the
percentage has been rapidly increasing.
The By-product Process
By this system of coking a greater
yield of coke is obtained and most of
the by-products are saved. The value of
the latter depends largely, of course,
upon local conditions, such as
transportation, costs of the material,
cost of labor, and available market for
the coke oven gas. They are usually Top of Ovens with
figured as having a value of $1.50 per Charging Bin and Lorry
ton of coal coked, equivalent to a total at Far End
of $71,000,000 per year for the coal
coked in the United States.
The ovens and apparatus required
are considerably more expensive, but,
since this industry has developed in
this country during the last twenty-two
years to a point where one-quarter of
all of the coke manufactured is made
by the by-product process, there can be
Lorry for Charging Coal no doubt that it is a profitable
into Ovens proposition and that eventually the
wasteful beehive ovens will be a thing
of the past.
Practically all of the types of by-product coke ovens in use have
been developed in Germany or Belgium, where circumstances forced
earlier conservation of resources than in this country. The three best
known types are the Semet-Solvay, the Otto Hoffman, and the
Koppers—the latter a recent arrival. They differ mainly in details of
construction and operation.
In a general way a “battery” of coke
ovens consists of from 40 to 80 long
narrow brick-walled chambers placed
closely side by side with heating flues
or “checker-work” between them. The
fire for the baking process is in these
flues, which are interconnected, and
the heat developed is sufficient to drive Machine for Pushing
off the moisture and volatile Coke from Ovens
substances of the coal in the narrow
chambers just on the other side of the
brick walls. Charging is done by a “lorry” as in the beehive process.
After from seventeen to twenty-four hours at a red heat, the coke is
“pushed” from the ovens, one after another, by an electric ram which
enters at one end. The 30 × 7 × 1½ foot block of glowing coke
emerges from the other end, where, breaking under its own weight
into good-sized pieces, it falls into a steel car on a track just beneath.
A spray of water quenches it and it is taken to the storage bins to be
sorted.
Rich coal-gas is the main by-product. That which comes off during
the first seven hours is the richest and has the greatest illuminating
or “candle” power. After washing free from dust, tar, ammonia, etc.,
the gas is usually run into holders or tanks from which it is
distributed for use for illuminating or for heating purposes. That
which comes off during the latter part of the coking period has much
less of those constituents which give illuminating value. It has good
heat value, however, and as fuel is required for keeping the ovens up
to the coking temperature, this poorer gas from the coking chambers
is switched into and burns in the flues between the coking chambers
as mentioned.
Thus the larger part of the gas is sold to customers, usually in the
city near which the ovens have been located, and the poorer part is
utilized in heating the ovens and the steam boilers which run the
plant.
The coal tar, which the German
chemists have made so famous
through its manufacture into the
almost endless variety of beautiful
dyes, is another of the by-products
which is recovered by this, but burned
or lost in the beehive oven process.
From a long main over the tops of the
ovens which connects the gas pipes,
the tar flows along with the gas to the
scrubbing and gas cleaning plant,
Quenching Car Awaiting where by rather intricate operations it
Its Load is freed from other substances.
In this country much of the tar is
used for building purposes, etc., and
some as fuel, but not much has been made into the chemical
products for which Germany is so famous. For a long time a few dyes
and other chemical compounds have been made here from coal tar.
Since the early days of the war in Europe and the cessation of
imports of such materials on this account, there has come about
considerable expansion in their manufacture here; but it is doubtful
if the time is yet ripe for a wholesale entry into the manufacture of
these coal tar “derivatives,” especially the very extensive variety of
dyestuffs.
Naphthalene and benzol from which many other chemical
compounds as well as munitions of war can be made, are among the
by-products.
Most of the ammonia which the
corner drug store sells, comes from the
by-product manufacture of coke. The
largest part of the ammonia which is
produced in the process, however, is
manufactured into sulphate of
ammonia, a well-known fertilizer.

Quenching the Coke


Coal
Anthracite or hard coal has been used in certain districts in the
United States, especially in New Jersey and eastern Pennsylvania. It
is not an ideal fuel as it is too solid to burn rapidly, spalls or cracks
under heat and interferes with the blast. Since 1860 when coke
became available here much less coal has been used, though some is
yet used in admixture with coke. Some bituminous coals which
contained little tarry matter also have been used in this way.
Fluxes
Limestone, the rock which is ordinarily used for fluxing purposes,
needs no introduction to any of us. As the marble of statuary, the
material of which oyster and other sea shells and the white
tombstones of our cemeteries are composed, it is well known. Any of
these varieties of the material may be used for fluxing purposes, but
usually it is limestone which is quarried for the purpose or obtained
as chippings or spalls from building blocks.
The
active
agent,
which
produces
the
chemical
or
fluxing
action in
Coke Going from the blast Loading Coke in Box Car
Quenching Car to Bins furnaces,
is
carbonate of calcium (lime) of which
limestone contains about 98 per cent. Dolomite is a mixture of
carbonates of lime and magnesium, about 53 per cent of the former
and 45 per cent of the latter, and is sometimes used in place of
limestone. Fluor spar, a rock composed of calcium and fluorine, is
used in small quantities in some of the metallurgical processes. It is a
very powerful flux.
CHAPTER IV
THE BLAST FURNACE

Up the dark tower shoots the elevator with its “buggy” of coke. Its
speed is not conditioned to the comfort of man, who is not supposed
to be a passenger, except the occasional laborer whose duty as buggy-
pusher requires his presence on twelve-hour shifts at the top. So we,
whose exploratory proclivities have led us at the office to sign away
our lives for grant of a pass to the blast furnace, find our breath about
taken from us with the first mad dash into the darkness of the climb.
That stone tower had looked much more innocent from below.
But now the rickety elevator has as suddenly emerged into the light
again and stopped abruptly at the charging floor which extends
across the chasm to the top of the furnace.
As the smoke-begrimed buggy-pushers rush the buggy of coke
across to the furnace bell, we have opportunity to notice that we are a
full hundred feet above ground. Just here, seemingly so close that we
can put our hands on them, in a row, are the round steel tops of the
four stoves which are for the purpose of preheating the blast. The
huge pipes, dust arresters, tanks, and buildings, all so necessary to
the plant, look almost like a tangled mass from our high station,
while the charging floor upon which we stand, the shoulder-high
steel fence around it, the furnace top, the adjacent stoves and in fact
everything for a half mile around us, is colored yellow-red with iron
dust. We understand the reason for this when the buggies of ore
which have succeeded the coke are dumped into the funnel-shaped
depression around the conical bell at the center. As the huge bell is
lowered and the charge slides in there is considerable blowing out of
the fine ore dust, which, in fact, continually “oozes” out of all crevices
under the heavy pressure of the blast inside.
Day and night, month in and month
out, during the life of the fire-brick
lining of the furnace, this routine of
charging, first coke, next the
theoretically correct charge of analyzed
iron ores, then limestone, in rotation
goes on. From 6 A.M. to 6 P.M. and from
6 P.M. to 6 A.M. on twelve-hour shifts,
alternating gangs of laborers push the
buggies across to the furnace top,
Hand-fed Blast Furnace dump and return them to the elevator
already up with another load.
The incessant quiver of the iron
plates beneath our feet with the rumbling and groaning from the
inside of this monster are disquieting and the thought constantly
recurs: “What if this powerful creature should just now rebel, as quite
occasionally occurred in the old days when all of its moods had not
been so well understood?” For this king of metallurgical devices,
though gentle and obedient as a lamb under proper treatment, is a
domineering fury when it has dyspepsia as occurs whenever its
attendants are remiss in their attentions to its diet. “Those explosion
doors just below the furnace top—are they in working order and
would they be adequate?” But whether, as in recorded instances, the
whole furnace top is torn off as evidence of its wrath, or its
displeasure is exhibited in a milder way, we much prefer to be absent.
The thought is disquieting and we are glad to leave.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookfinal.com

You might also like