Use of Principal Component Analysis (PCA) and Hierarchical Cluster Analysis 2017
Keywords: Background: The development of statistical software has enabled food scientists to perform a wide variety of
Chemometrics mathematical/statistical analyses and solve problems. Therefore, not only sophisticated analytical methods but
Principal component analysis also the application of multivariate statistical methods have increased considerably. Herein, principal compo-
Cluster analysis nent analysis (PCA) and hierarchical cluster analysis (HCA) are the most widely used tools to explore similarities
Correlation analysis
and hidden patterns among samples where relationship on data and grouping are until unclear. Usually, larger
Bioactive compounds
chemical data sets, bioactive compounds and functional properties are the target of these methodologies.
Functional properties
Scope and approach: In this article, we criticize these methods when correlation analysis should be calculated
and results analyzed.
Key findings and conclusions: The use of PCA and HCA in food chemistry studies has increased because the results
are easy to interpret and discuss. However, their indiscriminate use to assess the association between bioactive
compounds and in vitro functional properties is criticized as they provide a qualitative view of the data. When
appropriate, one should bear in mind that the correlation between the content of chemical compounds and
bioactivity could be duly discussed using correlation coefficients.
1. Introduction 2014; Munck, Nørgaard, Engelsen, Bro, & Andersson, 1998; Qannari,
2017). Conversely, the application of chemometrics for assessing the
As well stressed by Ropodi, Panagou, and Nychas (2016), in the 21st adulteration and geographical origin of foods based on chemical mar-
century, governmental, industrial, and academic problems need to be kers is well established in food science (Granato, Koot, Schnitzler, & van
addressed by using sophisticated analytical tools with proper data Ruth, 2015; Granato, Margraf, Brotzakis, Capuano, & van Ruth, 2015;
collection, analysis and interpretation. In this sense, data mining and Paneque, Morales, Burgos, Ponce, & Callejón, 2017; Giannetti, Mariani,
data analysis are two interrelated approaches developed rapidly to Mannino, & Marini, 2017; Opatić et al., 2018). For example, Garrido-
address problems related to engineering and technology, as well as Delgado, Muñoz-Pérez, and Arce (2018) used ion mobility spectrometry
medicine, economics, biology, and food science (Brown, 2017). (IMS) to determine the origin of the olive oil, quality and adulteration
Chemometrics is an interfacial discipline that extracts useful in- with low-cost vegetable oils. Using different statistical tools, authors
formation from large chemical and biochemical data sets using different were able to predict the level of contaminating oil in olive oil. There-
mathematical and statistical methods (Brown, 2017; Nunes, Alvarenga, fore, there is no doubt that chemometric tools is of fundamental im-
Sant'Ana, Santos, & Granato, 2015). In applied chemistry, the use of portance to solve real life problems.
chemometrics has been spread and well recognized since 1960 Granato, Nunes, and Barba (2017) stated that the use of design of
(Brereton, 2014), but in food sciences and technology the applications experiments together with appropriate statistical data analysis is of
of chemometrics and sensometrics (multivariate methods applied to pivotal importance to assess the association between nutrition, biology,
sensory data and studies consumers) are somewhat new (Aquino et al., pharmacology, functional properties and the chemical components of
foods and their extracts. In this sense, chemometric tools and other orthogonal partial least squared discriminant (OPLS-DA). After ana-
statistical methodologies may be of interest when different food ex- lyzing the experimental data, authors were able to identify 83 com-
tracts and bioactivities need to be evaluated (Granato, de Araújo pounds, in which 39 were metabolites, in the biological samples. In
Calado, & Jarvis, 2014). addition, the metabolic pathway (glucoronidation) by which these
In real life applications, chemometrics may be employed in food metabolites formed after oral administration of the decoction was
science and technology studies either to assess similarities/differences identified by using OPLS-DA. This research is an example on how
between multiple objects (samples) or to project the objects in a two/ chemometric tools are important aids in not only in the food chemistry
three-dimensional factor-plane based on various characteristics. field but also in the experimental nutrition studies.
Therefore, clusterings can be observed and the reasons for the grouping According to Brereton (2015), chemometrics users tend to ‘follow
can be pinpointed (Erasmus, Muller, Butler, & Hoffman, 2018; Jandrić the crowd’ and use indiscriminately the available software without
& Cannavan, 2017; Lund, Brown, & Shipley, 2017). Additionally, knowing the principles and fundamentals of each method applied in
multivariate techniques have been widely used to authenticate/trace their research data analysis. In food chemistry studies, Principal Com-
the geographical origin of foods, to verify the farming system employed ponents Analysis (PCA) and Hierarchical Cluster Analysis (HCA) are
by a company and check whether it complies to the information de- widely (and, sometimes, improperly) applied as “unsupervised classifi-
clared on the label, and to check for adulterations (intentional or not) of cation” methods to assess the association between bioactive compounds
foods and raw materials (Granato, Koot, &, van Ruth, 2015; Chiesa and in vitro functional properties (i.e., antioxidant and inhibition of
et al., 2016; Müller-Maatsch, Schweiggert, & Carle, 2016; Tavares et al., enzymes). Herein, a critical perspective on these display techniques
2016; Zhu, Wang, & Chen, 2017; Karabagias et al., 2017; Chung et al., (PCA and HCA) is made together with some comments on their use in
2017; Giannetti et al., 2017; Acierno, Alewijn, Zomer, & van Ruth, the field of bioactive compounds.
For example, Luo, Shi, and Feng (2017) aimed to characterize the
metabolites of Zhi-Zi-Hou-Po decoction, a traditional Chinese medicine, 2. Study of bioactive compounds and in vitro potential functional
in rat bile, urine and feces after oral administration, using untargeted properties with the use of chemometrics
liquid chromatography time of flight mass spectrometry combined with
Chemometrics may be used for both qualitative and quantitative
Table 1 guava fruit pulps (Psidium guajava L.) by HPLC, including (+)-catechin,
Factor loadings for illustrating the interpretation of Fig. 2. gallic, ferulic, trans-cinnamic, chlorogenic, caffeic, p-coumaric, syringic,
vanillic, and ellagic acid, rutin, quercetin, and kaempferol. The ex-
Factor PC1 PC2 PC3 PC4
traction procedure was optimized using different concentrations of
DPPH 0.69 −0.47 0.16 −0.42 ethyl alcohol and methyl alcohol for 15–90 min using a sample to
ABTS 0.68 0.06 −0.40 0.44 solvent ratio between 1:30 and 1:100 w/v. The extracts were also
FRAP 0.63 −0.65 −0.12 −0.02
analyzed for total phenolic content, ascorbic acid, and flavonoids, to-
Gallic acid 0.50 −0.66 0.09 −0.30
Caffeic acid 0.81 −0.23 0.22 0.15 gether with the antioxidant activity toward DPPH and ABTS radicals.
5-O-caffeoylquinic acid 0.04 −0.70 −0.50 0.27 PCA was able to explain only 60% of data variability with 2 PC, but a
(+)-Epicatechin −0.75 −0.54 −0.30 −0.09 clear separation between ripe and green guava fruits was observed from
(+)-Catechin −0.90 −0.07 0.08 −0.17 the scatter plot. The main responses that separated the groups were
Quercetin −0.90 −0.19 0.03 −0.08
syringic acid, (+)-catechin, p-coumaric acid, caffeic acid, ellagic acid,
Quercetrin −0.52 −0.26 0.70 −0.36
Luteolin −0.78 −0.05 −0.65 0.09 trans-cinnamic acid and rutin for the green guava, while for ripe and
Ellagic acid 0.13 0.48 −0.46 −0.73 white guava, the better markers were gallic acid and chlorogenic acid.
Eigenvalue 5.39 3.78 0.56 0.23 As rational subsequent step, authors applied ANN (a supervised algo-
Explained variance (%) 50.35 30.56 8.05 3.18
rithm) on same data set to obtain a reliable methodology to classify
Note: bold numbers are factor loadings higher than 0.60.
their samples. ANN showed a suitable separation between not only
green and white variety but also ripe and unripe guava fruits. It should
ferric reducing antioxidant power - FRAP). Similarly, PC2 explained be stressed that as data were successfully analyzed by PCA, a linear
another 30% of variability in the original responses and separates the algorithm, LDA or PLS-DA was the logical way to try.
juices based on FRAP, gallic acid, and 5-O-caffeoylquinic acid. PC3 and However, in some cases, the differentiation between classes is not so
PC4 explain only 11% of data variance and barely does not differentiate clear (Fig. 3A) and outliers (one or more observation point(s) that is/are
the juice samples. The factor loadings from PC3 and PC4 were very low unusually distant from the other observations) can be detected in the
(except for quercetrin/luteolin and ellagic acid, respectively). Factor dataset. In this case, the researcher cannot expect a straightforward
loadings lower than 0.60 indicate that those variables that do not fit separation between classes. Almost perfect segregation was obtained
well with the factor solution should possibly be dropped from the when all samples are analyzed after outliers removal (in synthetic data)
analysis, especially if the projection of samples on a factor-plane is using only two principal components (PCs), as shown in Fig. 3B.
based on a 2-dimensional graph. As a final comment, the first two PCs Fidelis et al. (2017) evaluated multiple juices from different bota-
explain about 81% of data variance but there remains room for about nical origins (fruits and other vegetables) in relation to some classes of
19% unexplained variation. phenolics/bioactive compounds (tannins, total phenols, flavonoids,
Once the representative PCs were found, on the basis of samples ortho-diphenols, flavonols, total anthocyanins, and betalains), physi-
differentiation/grouping and variance explained, loading analysis is cochemical properties (pH, soluble solids, and acidity), and antioxidant
started in order to find the underlying relationships in the original data effects (Fe2+ chelating properties, antiradical effect (DPPH, ABTS, and
structure. In this step loading could be visualized as a regression vector FRAP)), Folin-Ciocalteu's reducing capacity, and total reducing capa-
(a vector of correlation coefficients between the original variables with city. A total of 570 data points (38 juices and 15 responses) were
each PC-score). The positive factor loadings indicate that the factor will analyzed for patterns using PCA, which explained 72% of data varia-
be higher in the positive axis of that PC. For example, for DPPH, a factor bility with 2 PC and it was possible to pinpoint the juices with higher
loading of 0.69 was obtained with PC1, which means that the samples bioactive compounds and antioxidant activity. PLS-DA was used to
located in the right-hand side (i.e., violet stars) of the graph have higher discriminate juice groups and authors were able to separate Citrus juices
mean DPPH values than the samples located in the left-hand side (i.e., from Super juices (made with berries) with correct classification rates
red stars). Similarly, the negative factor loadings indicate that the factor above 73%, while data-driven SIMCA, which is a one-class classification
will be higher in the positive axis of that PC. For example, for method, was able to discriminate the juices samples with accuracy
(−)-epicatechin a factor loading of −0.75 was obtained for PC1, higher than 86%. In this research, authors concluded that the use of DD-
meaning that the samples located in the right-hand side (i.e., violet SIMCA may be of interest when the authentication of juices based on
stars) of the graph have lower mean concentrations than the samples phenolic compounds and antioxidant activity need to be performed,
located in the left-hand side (i.e., red stars). especially in quality control programs in the juice industry.
As a complementary analysis, as an illustrative example, PCA data Kalaycıoğlu, Kaygusuz, Döker, Kolaylı, and Erim (2017) used PCA to
may be compared to correlation coefficients (Table 2). As shown, the explore only n = 10 Turkish honeybee pollens from distinct origins
antioxidant activity measured by three different assays (i.e., ABTS,
FRAP, and DPPH) is mainly correlated (p < .05) to caffeic acid, Table 2
(−)-epicatechin, (+)-catechin, quercetin, and luteolin. FRAP also Illustrative correlation coefficients to help in the interpretation of the example shown in
correlated significantly with gallic acid and 5-O-caffeoylquinic acid. In Fig. 2.
this sense, if the main objective is to check for association between Responses DPPH ABTS FRAP
bioactive compounds and functional properties, correlation analysis
should be carried out. DPPH 1
For instance, Pearson's correlation coefficients or Spearman's rank ABTS 0.899 1
FRAP 0.946 0.947 1
correlation coefficients are the choices for normally distributed data Gallic acid 0.564* 0.529* 0.608
and for data do not conform to the normal distribution, respectively (de Caffeic acid 0.895 0.911 0.935
Oliveira et al., 2015). 5-O-caffeoylquinic acid 0.523* 0.518* 0.622
As a final comment on this topic, there is no scientific need to (+)-Epicatechin 0.875 0.812 0.804
(+)-Catechin 0.926 0.874 0.935
perform PCA or HCA for data sets that have a similar conclusion as the
Quercetin 0.873 0.924 0.901
one shown in the above-mentioned example. However, if the number of Quercetrin 0.425* 0.378* 0.333*
responses and samples is quite large and data are quite complex (i.e., Luteolin 0.788 0.829 0.845
NMR spectra), PCA is highly indicated. Ellagic acid 0.238* 0.356* 0.458*
dos Santos et al. (2017) quantified 13 phenolic compounds in 96
Note: * denotes p > .05 while the other correlation coefficients present p < .05.
2.3. Overall comments on PCA and HCA especially of a large data set is analyzed. However, the indiscriminate
use of multivariate exploratory statistical techniques (PCA and HCA) to
Both PCA and HCA are usually used concomitantly in studies cov- assess the association between bioactive compounds and in vitro func-
ering bioactive compounds and functional properties. To illustrate what tional properties is criticized as the results will be, in most cases, a sine
is widely seen in published articles, consider the following: n = 20 qua non observation. When appropriate, the researcher should bear in
samples coming from two fruits (A and B) are analyzed for the con- mind that the correlation between the content of chemical compounds
centrations of total phenolics, carotenoids, antioxidant activity mea- and bioactivity could be duly discussed using simple correlation coef-
sured by the oxygen radical absorbance capacity (ORAC) assay, and ficients.
inhibition of amylase and lipase. Results were analyzed using PCA and
the 2D projection is given in Fig. 5A: it is possible to see a defined Acknowledgements
cluster containing fruit “B” and another group containing most “A”
fruits. However, there are n = 3 “A” samples that are far from the main Daniel Granato acknowledges CNPq for a productivity grant (pro-
“A” group. One could say they are outliers simply by looking at the cess 303188/2016-2). J. S. Santos and G. B. Escher thank CAPES/
projection, but this cannot be done as PCA does not “classify” objects. In Fundação Araucária for their Ph.D scholarships.
Fig. 5B, HCA was applied using the Ward's method as the amalgamation
D. Granato et al. Trends in Food Science & Technology 72 (2018) 83–90
