0% found this document useful (0 votes)
34 views163 pages

Analisis Peubah Ganda: Pertemuan VIII

1. Principal component analysis was performed on climate and human ecology variables from air pollution data collected in 41 US cities to determine which variables best predict air pollution levels. 2. The analysis found three principal components that accounted for 85% of the variance in the original variables. The first component related to overall environmental quality, the second to rainfall, and the third to differences in climate types between cities. 3. Scatterplots of the principal component scores revealed some cities as outliers, such as Chicago, and provided a low-dimensional map of the cities based on the principal components.

Uploaded by

SANDI PALAGALANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views163 pages

Analisis Peubah Ganda: Pertemuan VIII

1. Principal component analysis was performed on climate and human ecology variables from air pollution data collected in 41 US cities to determine which variables best predict air pollution levels. 2. The analysis found three principal components that accounted for 85% of the variance in the original variables. The first component related to overall environmental quality, the second to rainfall, and the third to differences in climate types between cities. 3. Scatterplots of the principal component scores revealed some cities as outliers, such as Chicago, and provided a low-dimensional map of the cities based on the principal components.

Uploaded by

SANDI PALAGALANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 163

Analisis Peubah Ganda

Pertemuan VIII
ANALISIS KOMPONEN UTAMA
Let see this example: Air pollution in US
cities
Data were collected to investigate the determinants of
pollution. The following variables were obtained for 41 US
cities:
• SO2: SO2 content of air in micrograms per cubic metre;
• temp: average annual temperature in degrees Fahrenheit;
• manu: number of manufacturing enterprises employing 20
or more workers;
• popul: population size (1970 census) in thousands;
• wind: average annual wind speed in miles per hour;
• precip: average annual precipitation in inches;
• predays: average number of days with precipitation per
year.
The data are shown in next Table
• The main goal in the researcher's mind when collecting
the air pollution data, was to determine which of the
climate and human ecology variables are the best
predictors of the degree of air pollution in a city as
measured by the sulphur dioxide content of the air.

• This question would normally be addressed by multiple


linear regression, but there is a potential problem with
applying this technique to the air pollution data, and that is
the very high correlation between the manu and popul
variables.
• Very high correlation between the manu and popul
variables can be seen from correlation matrix of the data

> data_awal<-read.csv("air_pollution.csv“,
row.names=1)
> round(cor(data_awal),digits=3)
SO2 temp manu popul wind precip predays
SO2 1.000 -0.434 0.645 0.494 0.095 0.054 0.370
temp -0.434 1.000 -0.190 -0.063 -0.350 0.386 -0.430
manu 0.645 -0.190 1.000 0.955 0.238 -0.032 0.132
popul 0.494 -0.063 0.955 1.000 0.213 -0.026 0.042
wind 0.095 -0.350 0.238 0.213 1.000 -0.013 0.164
precip 0.054 0.386 -0.032 -0.026 -0.013 1.000 0.496
predays 0.370 -0.430 0.132 0.042 0.164 0.496 1.000
So what???

We might, of course, deal with this problem by simply


dropping either manu or popul, but here we will consider a
possible alternative approach, and that is regressing the
SO2 levels on the principal components derived from the
six other variables in the data.
First, What is Principal Components?
• The principal components are linear combinations of the
original variables, which are uncorrelated and are ordered
so that the first few of them account for most of the
variation in all the original variables.
• A principal component analysis is concerned with
explaining the variance-covariance structure of a set of
variables through a few linear combinations of these
variables.
• Its general objectives are:
A) data reduction and
B) interpretation.
Essentially, principal component analysis is a one-sample
technique applied to data with no groupings among the
observations and no partitioning of the variables into subsets y
and x,
More on PCA
• Although p components are required to reproduce the total
system variability, often much of this variability can be
accounted for by a small number k of the principal
components.

• The first principal component is the linear combination with


maximal variance; we are essentially searching for a
dimension along which the observations are maximally
separated or spread out.

• The second principal component is the linear combination


with maximal variance in a direction orthogonal to the first
principal component, and so on.
Each principal component is formed by taking the values of
the elements of the eigenvectors as the weights of the
linear combination Yi
Principal Component
(After Rotation)
• In some applications, the principal components are an end
in themselves and may be amenable to interpretation.

• More often they are obtained for use as input to another


analysis. For example, two situations in regression where
principal components may be useful are
1) if the number of independent variables is large
relative tothe number of observations, a test may be
ineffective or even impossible, and
2) if the independent variables are highly correlated, the
estimates of regression coefficients may be unstable.
Now, we shall first examine how principal
components analysis can be used to explore
various aspects of the data with data example
above, and will then look at how such an analysis
can also be used to address the determinants of
pollution question.
To begin we shall ignore the SO2 variable and concentrate
on the others, two of which relate to human ecology (popul,
manu) and four to climate (temp, Wind, precip, predays).

> keeps<-c(2,3,4,5,6,7)
> data_polusi<-data_awal[,keeps]

Since all six variables are such that high values represent a
less attractive environment, then a case can be made to use
negative temperature values.

> data_polusi$negtemp<-(-1)*data_polusi$temp
> data_polusi$temp<-NULL
Prior to undertaking the principal components analysis on
the air pollution data, we will construct a scatterplot matrix
of the six variables, including the histograms for each
variable on the main diagonal.
> panel.hist <- function(x, ...) {
+ usr <- par("usr"); on.exit(par(usr))
+ par(usr = c(usr[1:2], 0, 1.5) )
+ h <- hist(x, plot = FALSE)
+ breaks <- h$breaks; nB <- length(breaks)
+ y <- h$counts; y <- y/max(y)
+ rect(breaks[-nB], 0, breaks[-1], y, col="grey", ...)
+ }
> pairs(data_polusi, diag.panel = panel.hist,
+ pch = ".", cex = 1.5)
• A clear message from Figure above is that there is at least
one city, and probably more than one, that should be
considered an outlier.

• But for the moment we shall carry on with a principal


components analysis of the data for all 41 cities
• From the data, it seems necessary to extract the principal
components from the correlation rather than the
covariance matrix, since the six variables to be used are
on very different scales.

• Correlation matrix of the data:


> round(cor(data_polusi),digits=3)
manu popul wind precip predays negtemp
manu 1.000 0.955 0.238 -0.032 0.132 0.190
popul 0.955 1.000 0.213 -0.026 0.042 0.063
wind 0.238 0.213 1.000 -0.013 0.164 0.350
precip -0.032 -0.026 -0.013 1.000 0.496 -0.386
predays 0.132 0.042 0.164 0.496 1.000 0.430
negtemp 0.190 0.063 0.350 -0.386 0.430 1.000
R syntax for principal component analysis:
> polusi_pca <- princomp(data_polusi, cor = TRUE)
> summary(polusi_pca, loadings = TRUE)

OUTPUT:
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Standard deviation 1.4819456 1.2247218 1.1809526 0.8719099 0.33848287 0.185599752
Proportion of Variance 0.3660271 0.2499906 0.2324415 0.1267045 0.01909511 0.005741211
Cumulative Proportion 0.3660271 0.6160177 0.8484592 0.9751637 0.99425879 1.000000000
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
manu -0.612 0.168 0.273 -0.137 0.102 0.703
popul -0.578 0.222 0.350 -0.695
wind -0.354 -0.131 -0.297 0.869 -0.113
precip -0.623 0.505 0.171 0.568
predays -0.238 -0.708 -0.311 -0.580
negtemp -0.330 -0.128 -0.672 -0.306 0.558 -0.136
Screeplot:
> screeplot(polusi_pca,type = "lines",col=4)
Interpretation??
• We see that the rest three components all have variances
(eigenvalues) greater than one and together account for
almost 85% of the variance of the original variables.
• Scores on these three components might be used to graph
the data with little loss of information.
• Many users of principal components analysis might be
tempted to search for an interpretation of the derived
components that allows them to be “labelled“ in some sense.
This requires examining the coeficients defining each
component.
• We may add component label like below
1. We see that the first component might be regarded as
some index of “quality of life", with high values
indicating a relatively poor environment.
2. The second component is largely concerned with a
city's rainfall having high coecients for precip and
predays and might be labelled as the “wet weather"
component.
3. Component three is essentially a contrast between
precip and negtemp and will separate cities having
high temperatures and high rainfall from those that are
colder but drier. A suitable label might be simply
“climate type".
But … you should know
It must be emphasized that no mathematical method is, or
could be, designed to give physically meaningful results. If a
mathematical expression of this sort has an obvious physical
meaning, it must be attributed to a lucky change, or to the
fact that the data have a strongly marked structure that
shows up in analysis. Even in the latter case, quite small
sampling fluctuations can upset the interpretation; for
example, the first two principal components may appear in
reverse order, or may become confused altogether.

Marriott (1974)
So, what’s next?
• The three components can be used as the basis of various
graphical displays of the cities. In fact, this is often the most
useful aspect of a principal components analysis.
• The first few component scores provide a low-dimensional
“map" of the observations.

• So we will begin by looking at the scatterplot matrix of the


first three principal components and in each panel show the
relevant bivariate boxplot;
> pairs(polusi_pca$scores[,1:3], ylim = c(-6, 4), xlim = c(-6, 4),
+ panel = function(x,y, ...) {
+ text(x, y, abbreviate(row.names(data_polusi)),
+ cex = 0.6)
+ bvbox(cbind(x,y), add = TRUE)
+ })
Scatterplot matrix of the first three principal components
• The plot demonstrates clearly that Chicago is an outlier
and suggests that Phoenix and Philadelphia may also be
suspects in this respect.

• Phoenix appears to offer the best quality of life (on the


limited basis of the six variables recorded), and Buffalo is
a city to avoid if you prefer a drier environment.
Now, back to the main goal

We will regressing the SO2 levels on the principal components


derived from the six other variables in the data.

The first question we need to ask is :

“How many principal components should be used as


explanatory variables in the regression?"
• The obvious answer to this question is to use the number of
principal components that were identified as important in
the original analysis; for example, those with eigenvalues
greater than one.

• But this is a case where the obvious answer is not


necessarily correct.

• It is possible that principal components with small variance


is a significant predictor of the response.
So, in this case, we will regress the SO2 variables on all six
principal components; the necessary R code is given below

> #add variable SO2 back to data


> data_polusi$SO2<-data_awal$SO2
>
> #plot
> par(mfcol=c(3,2))
> out <- sapply(1:6, function(i) {
+ plot(data_polusi$SO2,polusi_pca$scores[,i],
+ xlab = paste("PC", i, sep = ""),
+ ylab = "Sulphur dioxide concentration",
+ asp = 2)
+ })
> #regress
> polusi_reg <- lm(SO2 ~ polusi_pca$scores,
+ data = data_polusi)
> summary(polusi_reg)

Call:
lm(formula = SO2 ~ polusi_pca$scores, data = data_polusi)

Residuals:
Min 1Q Median 3Q Max
-23.004 -8.542 -0.991 5.758 48.758

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.049 2.286 13.146 6.91e-15 ***
polusi_pca$scoresComp.1 9.942 1.542 6.446 2.28e-07 ***
polusi_pca$scoresComp.2 -2.240 1.866 -1.200 0.23845
polusi_pca$scoresComp.3 0.375 1.935 0.194 0.84752
polusi_pca$scoresComp.4 -8.549 2.622 -3.261 0.00253 **
polusi_pca$scoresComp.5 -15.176 6.753 -2.247 0.03122 *
polusi_pca$scoresComp.6 39.271 12.316 3.189 0.00306 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.64 on 34 degrees of freedom


Multiple R-squared: 0.6695, Adjusted R-squared: 0.6112
F-statistic: 11.48 on 6 and 34 DF, p-value: 5.419e-07
Clearly, the first principal component score is the most
predictive of sulphur dioxide concentration, but it is also
clear that components with small variance do not
necessarily have small correlations with the response.
EXERCISE
Six hematology variables were measured on 51 workers
(Royston 1983):
y1 = hemoglobin concentration y4 = lymphocyte count
y2 = packed cell volume y5 = neutrophil count
y3 = white blood cell count y6 = serum lead concentration

Data can be seen on next page. Carry out a principal


component analysis on the data. Use both S and R . Which
do you think is more appropriate here? Show the percent of
variance explained. Based on the average eigenvalue or a
scree plot, decide how many components to retain. Can you
interpret the components of either S or R?
Table Hematology Data
Observation
y1 y2 y3 y4 y5 y6
Number
1 13.4 39 4 100 14 25 17
2 14.6 46 5 000 15 30 20
3 13.5 42 4 500 19 21 18
4 15 46 4 600 23 16 18
5 14.6 44 5 100 17 31 19
6 14 44 4 900 20 24 19
7 16.4 49 4 300 21 17 18
8 14.8 44 4 400 16 26 29
9 15.2 46 4 100 27 13 27
10 15.5 48 8 400 34 42 36
11 15.2 47 5 600 26 27 22
12 16.9 50 5 100 28 17 23
13 14.8 44 4 700 24 20 23
14 16.2 45 5 600 26 25 19
15 14.7 43 4 000 23 13 17
16 14.7 42 3 400 9 22 13
17 16.5 45 5 400 18 32 17
(Continued)
Observation
y1 y2 y3 y4 y5 y6
Number
18 15.4 45 6 900 28 36 24
19 15.1 45 4 600 17 29 17
20 14.2 46 4 200 14 25 28
21 15.9 46 5 200 8 34 16
22 16 47 4 700 25 14 18
23 17.4 50 8 600 37 39 17
24 14.3 43 5 500 20 31 19
25 14.8 44 4 200 15 24 29
26 14.9 43 4 300 9 32 17
27 15.5 45 5 200 16 30 20
28 14.5 43 3 900 18 18 25
29 14.4 45 6 000 17 37 23
30 14.6 44 4 700 23 21 27
31 15.3 45 7 900 43 23 23
32 14.9 45 3 400 17 15 24
33 15.8 47 6 000 23 32 21
34 14.4 44 7 700 31 39 23
35 14.7 46 3 700 11 23 23
(Continued)
Observation
y1 y2 y3 y4 y5 y6
Number
36 14.8 43 5 200 25 19 22
37 15.4 45 6 000 30 25 18
38 16.2 50 8 100 32 38 18
39 15 45 4 900 17 26 24
40 15.1 47 6 000 22 33 16
41 16 46 4 600 20 22 22
42 15.3 48 5 500 20 23 23
43 14.5 41 6 200 20 36 21
44 14.2 41 4 900 26 20 20
45 15 45 7 200 40 25 25
46 14.2 46 5 800 22 31 22
47 14.9 45 8 400 61 17 17
48 16.2 48 3 100 12 15 18
49 14.5 45 4 000 20 18 20
50 16.4 49 6 900 35 22 24
51 14.7 44 7 800 38 34 16
GROUP PROJECT
Tugas Kelompok
• Satu kelompok beranggotakan maksimal 5 orang
• Carilah contoh kasus yang sesuai dengan peminatan
Anda (SK atau SE) berikut data yang dibutuhkan.
• Lakukan analisis komponen utama dan analisis faktor
• Presentasikan hasil analisis pada pertemuan berikutnya!
• Konten:
• Bab 1: Pendahuluan
• Bab 2: Kajian Teori
• Bab 3: Hasil dan Pembahasan
• Bab 4: Kesimpulan
Analisis Peubah Ganda

Pertemuan IX
FACTOR ANALYSIS
Latent Variables vs Manifest Variables
• Latent variables is concepts that cannot be measured directly.
• But latent variables can be assumed to relate to a number of
measurable or manifest variables.
• Example: How could we measure intelligence?
Spearman sought to
Spearman, C. (1904). describe the influence of
“General intelligence” g on examinee’s test
objectively determined scores on several
and measured. American domains:
Journal of Psychology,
At the time, Psychologists
• Pitch
15, 201-293.
had thought that intelli- • Light
gence could be defined by • Weight
a single, all-encompassing • Classics
unobservable entity, • French
called “g” (for general • English
intelligence) • Mathematics
• The model proposed by Spearman was very similar to a linear
regression model:

Pitch

Light

Weight Intelligence

...

Math
The method of analysis most generally
used to help uncover the relationships
between the assumed latent variables and
the manifest variables is factor analysis

The model on which the method is based is essentially that of


multiple regression, except now the manifest variables are
regressed on the unobservable latent variables (often referred
to in this context as common factors)

So that direct estimation of the corresponding regression


coefficients (factor loadings) is not possible.
Common Factor Model
• The common factor model posited that scores were a function of
multiple latent variables, variables that represented more
specialized abilities.
• The Common Factor Model was also very similar to a linear
multiple regression model:

• The Common Factor Model could be more succinctly put by


matrices:
EFA vs CFA
• A point to be made at the outset is that factor analysis comes in
two distinct varieties:
Exploratory Factor Analysis
is used to investigate the relationship
between manifest variables and factors
without making any assumptions about
which manifest variables are related to
which factors

Confirmatory Factor Analysis


is used to test whether a specific factor
model postulated a priori provides an
adequate fit for the co-variances or
correlations between the manifest
variables
Loading and Communalities
• Given the Common Factor Model:

• The coefficients λij are called loadings and serve as weights,


showing how each yi individually depends on the f ’s.
• The assumptions are:
– Fi and ϵi are independent.
– E(F) = 0.
– Cov(F) = I , key assumption in EFA - uncorrelated factors (for orthogonal
factor model)
– E(ϵi) = 0.
– Cov(ϵi) = ψ - where ψ is a diagonal matrix.
• Then,

• Or

• we noted that the model predicted variance was defined as:

• The communality is also referred to as common variance, and the


specific variance ψi has been called specificity, unique variance, or
residual variance
Model Estimation Methods
• Because of the long history of factor analysis, many estimation
methods have been developed.
• Before the 1950s, the bulk of estimation methods were
approximating heuristics - sacrificing accuracy for “speedy”
calculations.
• Before computers became prominent, many graduate students
spent months (if not years) on a single analysis.
• Today, however, everything is done via computers, and a handful
of methods are performed without risk of careless errors.

Three estimation methods are:


• Principal component method.
• Principal factor method.
• Maximum likelihood.
1. Principal Component Method
• This name is perhaps unfortunate in that it adds to the confusion
between factor analysis and principal component analysis. In the
principal component method for estimation of loadings, we do not
actually calculate any principal components.
• The source of the term principal component came from the
structure of loading λ෠ 𝑖𝑗 . The columns of 𝚲
෡ are proportional to the
eigenvectors of S, so that the loadings on the j-th factor are
proportional to coefficients in the j-th principal component.
2. Principal Factor Method
• In the principal component approach to estimation of the loadings,
we neglected ψ and factored S or R. The principal factor method

(also called the principal axis method) uses an initial estimate ψ
and factors S − ψ෡ or R − ψ
෡ to obtain

• The principal factor method can easily be iterated to improve the


estimates of communality ➔ Iterated Principal Factor Method
• The principal factor method and iterated principal factor method
will typically yield results very close to those from the principal
component method when either of the following is true.
1. The correlations are fairly large, with a resulting small value of m.
2. The number of variables, p, is large.
Heywood Case:
A shortcoming of the iterative approach is that
2
sometimes it leads to a communality estimate ℎ෠ 𝑖
exceeding 1 (when factoring R).
2

If ℎ𝑖 > 1, then ψ෡ i < 0, which is clearly improper, since
we cannot have a negative specific variance.
3. Maximum Likelihood Method
• If we assume that the observations y1, y2, . . . , yn constitute a
random sample from Np(μ,∑), then Λ and ψ can be estimated by
the method of maximum likelihood. It can be shown that the
෡ satisfy the following:
෡ and ψ
estimates Λ

• These equations must be solved iteratively, and in practice the


procedure may fail to converge or may yield a Heywood case.
Estimation Method Comparison
• What you may discover when fitting the PCA method
and the ML method is that the ML method factors
sometimes account for less variances than the factors
extracted through PCA.
• This is because of the optimality criterion used for PCA,
which attempts to maximize the variance accounted for
by each factor.
• The ML, however, has an optimality criterion that
minimizes the differences between predicted and
observed covariance matrices, so the extraction will
better resemble the observed data.
Choosing The Number Of Factors
• As with PCA, the number of factors to extract can be somewhat
arbitrary.
• Several criteria have been proposed for choosing m, the number of
factors. Four of them, which are similar to those given for choosing the
number of principal components to retain, are:
1. Choose m equal to the number of factors necessary for the variance
accounted for to achieve a predetermined percentage, say 80%, of the
total variance tr(S) or tr(R).
2. Choose m equal to the number of eigenvalues greater than the average
𝑝 𝜃
eigenvalue. For R the average is 1; for S it is σ𝑗=1 𝑗ൗ𝑝
3. Use the scree test based on a plot of the eigenvalues of S or R. If the
graph drops sharply, followed by a straight line with much smaller
slope, choose m equal to the number of eigenvalues before the straight
line begins.
4. Test the hypothesis that m is the correct number of factors,
H0 : ∑ = ΛΛ’ + ψ,
where Λ is p × m.
Factor Rotations
• Rotation is a process by which a solution is made more
interpretable without changing its underlying mathematical
properties.
• Factor rotation merely allows the fitted factor analysis model to be
described as simply as possible
• Initial factor solutions with variables loading on several factors and
with bipolar factors can be difficult to interpret. Interpretation is
more straightforward if each variable is highly loaded on at most
one factor and if all factor loadings are either large and positive or
near zero.
• The variables are thus split into disjoint sets, each of which is
associated with a single factor.
• This aim is essentially what Thurstone (1931) referred to as simple
structure.
• The search for simple structure or something close to it begins
after an initial factoring has determined the number of common
factors necessary and the communalities of each observed
variable. The factor loadings are then transformed.

• And during the rotation phase of the analysis, we might choose to


abandon one of the assumptions made previously, namely that
factors are orthogonal, i.e., independent.

• Consequently, two types of rotation are possible:


➢ Orthogonal rotation, in which methods restrict the rotated
factors to being uncorrelated, or
➢ Oblique rotation, where methods allow correlated factors.
Orthogonal Rotation

• Orthogonal rotation is achieved by post-multiplying the original


matrix of loadings by an orthogonal matrix.

• With an orthogonal rotation, the matrix of correlations


between factors after rotation is the identity matrix.

• Two most commonly used techniques are known as varimax


and quartimax:
– Varimax rotation,
– Quartimax rotation.
1. Varimax Rotation
• Originally proposed by Kaiser (1958)
• Varimax rotation has as its rationale the aim of factors with a few
large loadings and as many near-zero loadings as possible.
• This is achieved by iterative maximisation of a quadratic function
of the loadings. It produces factors that have high correlations
with one small set of variables and little or no correlation with
other sets.
• There is a tendency for any general factor to disappear because
the factor variance is redistributed.
2. Quartimax Rotation
• Originally suggested by Carroll (1953)
• Quartimax rotation forces a given variable to correlate highly on
one factor and either not at all or very low on other factors. It is
far less popular than varimax.
Oblique Rotation
• For oblique rotation, the original loadings matrix is post-multiplied
by a matrix that is no longer constrained to be orthogonal.
• The corresponding matrix of correlations is restricted to have unit
elements on its diagonal, but there are no restrictions on the o-
diagonal elements.
• Two methods most often used are oblimin and pro-max.
• Oblimin rotation, invented by Jennrich and Sampson (1966),
attempts to find simple structure with regard to the factor pattern
matrix through a parameter that is used to control the degree of
correlation between the factors.
• Promax rotation, a method due to Hendrickson andWhite (1964),
operates by raising the loadings in an orthogonal solution
(generally a varimax rotation) to some power. The goal is to obtain
a solution that provides the best structure using the lowest
possible power loadings and the lowest correlation between the
factors.
Question: whether we should use?
• There is no universal answer to this question.
• There are advantages and disadvantages to using either type of
rotation procedure.
• As a general rule, if a researcher is primarily concerned with
getting results that “best fit" his or her data, then the factors
should be rotated obliquely. If, on the other hand, the researcher
is more interested in the generalisability of his or her results, then
orthogonal rotation is probably to be preferred.
• One major advantage of an orthogonal rotation is simplicity since
the loadings represent correlations between factors and manifest
variables.
• In many cases where these correlations are relatively small,
researchers may prefer to return to an orthogonal solution.
EXAMPLE WITH R
Example 1
• The data in Table 5.1 show life expectancy in years by country, age,
and sex. The data come from Keytz and Flieger (1971) and relate
to life expectancies in the 1960s.
• To begin, we will use the formal test for the number of factors
incorporated into the maximum likelihood approach. We can
apply this test to the data, assumed to be contained in the data
frame life with the country names labelling the rows and variable
names as given in Table 5.1, using the following R code:

• These results suggest that a three-factor solution might be


adequate to account for the observed covariances in the data,
• The three-factor solution is as follows (note that the solution is
that resulting from a varimax solution. the default for the
factanal() function):
• We see that the first factor is dominated by life expectancy at
birth for both males and females; perhaps this factor could
be labelled “life force at birth”
• The second reects life expectancies at older ages, and we
might label it “life force amongst the elderly".
• The third factor from the varimax rotation has its highest
loadings for the life expectancies of men aged 50 and 75 and
in the same vein might be labelled “life force for elderly
men".
The estimated factor scores are found as follows;

We can use the scores to provide the plot of the data


Checking adequacy of factor analysis
• Criteria of sample size adequacy: sample size 50 is very poor, 100
poor, 200 fair, 300 good, 500 very good, and more than 1,000
excellent (Comfrey and Lee, 1992, p.217).
• Kaiser-Meyer-Olkin’s sampling adequacy criteria (usually
abbreviated as KMO) with MSA (individual measures of sampling
adequacy for each item): Tests whether there are a significant
number of factors in the dataset:
• Technically, tests the ratio of item-correlations to partial item
correlations. If the partials are similar to the raw correlations, it
means the item doesn’t share much variance with other items.
• The range of KMO is from 0.0 to 1.0 and desired values are > 0.5.
Variables with MSA being below 0.5 indicate that item does not
belong to a group and may be removed form the factor analysis.
KMO in R
kmo <- function(x) {
x <- subset(x, complete.cases(x)) # Omit missing values
r <- cor(x) # Correlation matrix
r2 <- r^2 # Squared correlation coefficients
i <- solve(r) # Inverse matrix of correlation matrix
d <- diag(i) # Diagonal elements of inverse matrix
p2 <- (-i/sqrt(outer(d, d)))^2 # Squared partial correlation coefficients
diag(r2) <- diag(p2) <- 0 # Delete diagonal elements
KMO <- sum(r2)/(sum(r2)+sum(p2))
MSA <- colSums(r2)/(colSums(r2)+colSums(p2))
return(list(KMO=KMO, MSA=MSA))
}
• Bartlett’s sphericity test: Tests the hypothesis that
correlations between variables are greater than would
be expected by chance: Technically, tests if the matrix is
an identity matrix. The p-value should be significant: i.e.,
the null hypothesis that all off-diagonal correlations are
zero is falsified.
Bartlett’s sphericity test in R
Bartlett.sphericity.test <- function(x) {
method <- "Bartlett's test of sphericity"
data.name <- deparse(substitute(x))
x <- subset(x, complete.cases(x)) # Omit missing values
n <- nrow(x)
p <- ncol(x)
chisq <- (1-n+(2*p+5)/6)*log(det(cor(x)))
df <- p*(p-1)/2
p.value <- pchisq(chisq, df, lower.tail=FALSE)
names(chisq) <- "X-squared"
names(df) <- "df"
return(structure(list(statistic=chisq, parameter=df, p.value=p.value,
method=method, data.name=data.name), class="htest"))
}
Example 2
• The majority of adult and adolescent Americans regularly use
psychoactive substances during an increasing proportion of their
lifetimes. Various forms of licit and illicit psychoactive substance
use are prevalent, suggesting that patterns of psychoactive
substance taking are a major part of the individual's behavioural
repertory and have pervasive implications for the performance of
other behaviours. In an investigation of these phenomena, Huba,
Wingard, and Bentler (1981) collected data on drug usage rates for
1634 students in the seventh to ninth grades in 11 schools in the
greater metropolitan area of Los Angeles. Each participant
completed a questionnaire about the number of times a particular
substance had ever been used.
Analisis Faktor
Video ini dibuat untuk keperluan Perkuliahan Jarak Jauh
(PJJ) mata kuliah Analisis Peubah Ganda

Politeknik Statistika STIS

Oleh: Budi Yuniarto, SST, M.Si


Variabel Laten vs Variabel Manifes
• Laten  tersembunyi, tertutupi
• Variabel laten adalah variabel yang tidak bisa
diukur secara langsung.
• Variabel laten hanya bisa diukur melalui satu set
variabel lain (yang bisa diukur) melalui suatu
model matematika.
• Variabel lain yang memanifestasikan variabel
laten tersebut disebut sebagai variabel manifest
dari variabel laten tersebut  variabel indikator
Oleh: Budi Yuniarto, SST, M.Si
X
Do you
love me,
Mickey?

Love is
latent!!!

Oleh: Budi Yuniarto, SST, M.Si


Make a list of indicators:
1. He always checks out on you because
he is curious to know what you are
doing!
2. He can really do anything to solve
your problem
3. His face glows with happiness as he
sees you!
4. He is interested to know how you
spent your day and listens
attentively!
5. He respects your opinion!

Variabel Manifes

Oleh: Budi Yuniarto, SST, M.Si


Contoh lain: Kecerdasan…

Spearman berusaha
Spearman, C. (1904). menggambarkan
mencoba mengukur pengaruh g pada nilai tes
“General intelligence”. peserta ujian pada
American Journal of beberapa domain:
Psychology, 15, 201-293. • Pitch
Pada saat itu, para • Light
Psikolog berpikir bahwa • Weight
kecerdasan dapat • Classics
didefinisikan oleh entitas • French
tunggal yang tidak dapat • English
diobservasi, yang disebut
faktor "g"
• Mathematics

Oleh: Budi Yuniarto, SST, M.Si


• Model yang diusukan Spearman mirip dengan mode
regresi linier:

Pitch

Light

Weight Intelligence

...

Math

Oleh: Budi Yuniarto, SST, M.Si


Variabel laten  tidak
Variabel manifes bisa
bisa diukur langsung /
diukur langsung
unobservable

Sehingga estimasi langsung untuk mencari koefisien regresi


yang sesuai pada model tersebut tidak memungkinkan.

Metode analisis yang paling umum digunakan untuk membantu


mengungkap hubungan antara variabel laten dan variabel manifes
adalah analisis faktor

Oleh: Budi Yuniarto, SST, M.Si


Exploratory Factor Analysis
digunakan untuk menyelidiki hubungan
antara variabel manifes dan faktor tanpa
membuat asumsi tentang variabel
manifes mana yang terkait dengan
faktor mana  Faktor belum diketahui
Confirmatory Factor Analysis
digunakan untuk menguji apakah
model faktor spesifik yang
dipostulatkan secara apriori
memberikan kesesuaian yang memadai
untuk ko-varians atau korelasi antara
variabel manifes  Diasumsikan faktor
sudah diketahui
Oleh: Budi Yuniarto, SST, M.Si
Confirmatory Factor Analysis

Oleh: Budi Yuniarto, SST, M.Si


Confirmatory Factor Analysis
X1 X2 X3 X4 X5

Faktor 1
Kita sudah memiliki
asumsi mengenai
faktor yang
terbentuk
Faktor 2

Oleh: Budi Yuniarto, SST, M.Si


Confirmatory Factor Analysis
X1 X2 X3 X4 X5

Faktor 1
Kita sudah memiliki
asumsi mengenai
faktor yang
terbentuk
Faktor 2

Oleh: Budi Yuniarto, SST, M.Si


Confirmatory Factor Analysis

X3

X4
Faktor 1
Kita sudah memiliki
X5 asumsi mengenai
faktor yang
X1 terbentuk
Faktor 2

X2

Oleh: Budi Yuniarto, SST, M.Si


Exploratory Factor Analysis

Oleh: Budi Yuniarto, SST, M.Si


Exploratory Factor Analysis
X1 X2 X3 X4 X5

Faktor 1

Faktor 2 Sehingga kita akan


Kita belum memiliki
mengeksplorasi
asumsi apapun
semua kemungkinan
Faktor 3 mengenai faktor
faktor yang
yang terbentuk
terbentuk
Faktor 4

Faktor 5

Oleh: Budi Yuniarto, SST, M.Si


Exploratory Factor Analysis
X1 X2 X3 X4 X5

Faktor 1

Faktor 2 Sehingga kita akan


mengeksplorasi
semua kemungkinan
Faktor 3
faktor yang
terbentuk
Faktor 4

Faktor 5

Oleh: Budi Yuniarto, SST, M.Si


Exploratory Factor Analysis

X1
Faktor 1

X2 Faktor 2 Sehingga kita akan


mengeksplorasi
X3 semua kemungkinan
Faktor 3
faktor yang
terbentuk
X4
Faktor 4

X5
Faktor 5

Oleh: Budi Yuniarto, SST, M.Si


Exploratory Factor Analysis

X1
Faktor 1

X2 Pada akhirnya kita


Faktor 2
akan menentukan
faktor mana yang
X3 Faktor 3 penting dan apa
variabel manifesnya
X4 yang sesuai
Faktor 4

X5
Faktor 5

Oleh: Budi Yuniarto, SST, M.Si


Exploratory Factor Analysis

X1
Faktor 1

X2 Pada akhirnya kita


Faktor 2
akan menentukan
faktor mana yang
X3 Faktor 3 penting dan apa
variabel manifesnya
X4 yang sesuai

X5

Oleh: Budi Yuniarto, SST, M.Si


Common Factor
X1
Faktor 1

Model common factor X2 Faktor 2


menyatakan bahwa variabel
manifes adalah fungsi dari
X3
beberapa variabel laten. Faktor 3

...

...
Xp
Faktor m

Oleh: Budi Yuniarto, SST, M.Si


Model Common Factor mirip dengan model regresi linear
berganda, dengan model dinyatakan sebagai berikut:

Atau

𝑿 = 𝝁 + Λ F + 𝝐
( p x 1) (p x 1) (p x m) (m x 1) (p x 1)

Oleh: Budi Yuniarto, SST, M.Si


𝑿 = 𝝁 + Λ F + 𝝐
( p x 1) (p x 1) (p x m) (m x 1) (p x 1)

• Koefisien λij (di dalam matriks Λ) disebut loadings


berperan sebagai pembobot dalam fungsi
tersebut.
• Asumsi dalam common factor:
– Fi and ϵi independen.
– E(F) = 0.
– Cov(F) = I , asumsi utama dalam EFA (dengan kata lain
antar faktor tidak berkorelasi)
– E(ϵ) = 0.
– Cov(ϵ) = ψ - dimana ψ berupa matriks diagonal.
Oleh: Budi Yuniarto, SST, M.Si
𝑿 = 𝝁 + Λ F + 𝝐
( p x 1) (p x 1) (p x m) (m x 1) (p x 1)

• Sehingga:
𝐶𝐶𝐶 𝑋 = 𝐶𝐶𝐶 𝜇 + Λ F + 𝜖
𝐶𝐶𝐶 𝑋 = 𝐶𝐶𝐶(𝜇) + 𝐶𝐶v (Λ F) + Cov(𝜖)
𝐶𝐶𝐶 𝑋 = Λ 𝐶𝑜v (F) Λ′+ Ψ
𝐶𝐶𝐶 𝑋 = Λ Λ′+ Ψ

Keragaman pada X bersumber


dari 2 hal yaitu dari faktor F
dan dari error ϵ

Oleh: Budi Yuniarto, SST, M.Si


• Jika kita lihat elemen-elemen penyusun matriks kovarians
tersebut:
𝐶𝐶𝐶 𝑋 = Λ Λ′+ Ψ
Var(X1) Cov(X1X2) … Cov(X1Xp) Λ12 + Ψ Λ1Λ2) … Λ1Λp
Cov(X1X2) Var(X2) … Cov(X2Xp) Λ1Λ2) Λ22 + Ψ … Λ2Λp
… … … …
= … … … …
Cov(X1Xp) Cov(X2Xp) … Var(Xp) Λ1Λp Λ2Λp … Λp2 + Ψ

Oleh: Budi Yuniarto, SST, M.Si


• Communality disebut juga sebagai common variance
sedangkan specific variance disebut juga sebagai specificity,
unique variance atau residual variance.
• Communality adalah keragaman yang berasal dari faktor.
• Spesific variance menggambarkan jumlah varians yang tidak
mampu dijelaskan oleh faktor.

Oleh: Budi Yuniarto, SST, M.Si


Metode Estimasi

Terdapat 3 metode estimasi model:


• Principal component method.
• Principal factor method.
• Maximum likelihood.

Oleh: Budi Yuniarto, SST, M.Si


Menentukan jumlah Faktor
• Pilih m = jumlah faktor yang diperlukan agar kumulatif
proporsi varians mencapai persentase yang telah ditentukan,
misal 80%, dari total varians.
• Pilih m = jumlah nilai eigen yang lebih besar dari rata-rata nilai
eigen. Jika menggunakan R rata-rata eigen adalah 1; jika S
𝜃
rata-rata eigen ∑𝑝𝑗=1 𝑗�𝑝
• Gunakan scree plot berdasarkan plot nilai eigen dari matriks S
atau R. Jika grafik turun tajam, diikuti oleh garis melandai
dengan kemiringan yang jauh lebih kecil, pilih m = jumlah nilai
eigen sebelum garis melandai dimulai.
• Uji hipotesis bahwa m adalah jumlah faktor yang benar.

Oleh: Budi Yuniarto, SST, M.Si


Rotasi Faktor
• Rotasi adalah proses di mana solusi dibuat untuk lebih
mudah ditafsirkan tanpa mengubah sifat matematika yang
mendasarinya.
• Rotasi faktor memungkinkan fitted factor analysis model
digambarkan sesederhana mungkin
• Solusi faktor awal dengan loading variabel yang besar pada
beberapa faktor sekaligus sulit ditafsirkan. Interpretasi lebih
mudah jika masing-masing loading variabel bernilai besar
pada satu faktor saja dan faktor lainnya mendekati nol.
• Dengan demikian variabel dibagi menjadi set terpisah,
masing-masing terkait dengan faktor tunggal.
• Tujuan ini pada dasarnya adalah apa yang disebut
Thurstone (1931) sebagai struktur sederhana.

Oleh: Budi Yuniarto, SST, M.Si


Ilustrasi Rotasi Faktor
X2

Faktor 2
Faktor 1

μ = (μ1, μ2)

X1

Oleh: Budi Yuniarto, SST, M.Si


Ilustrasi Rotasi Faktor
X2

Faktor 2
Faktor 1

μ = (μ1, μ2)

X1

Oleh: Budi Yuniarto, SST, M.Si


X2

Ilustrasi Rotasi Faktor

Faktor 2
Faktor 1

μ = (μ1, μ2)

X1

Oleh: Budi Yuniarto, SST, M.Si


Terdapat dua jenis rotasi :
• Rotasi ortogonal, di mana metode membatasi
agar faktor-faktor yang dirotasi dipertahankan
tetap tidak berkorelasi. Contoh metode:
Varimax rotation, Quartimax rotation.

• Rotasi miring (oblique), di mana metode


memungkinkan faktor berkorelasi setelah
dirotasi. Contoh metode: Oblimin dan Promax.

Oleh: Budi Yuniarto, SST, M.Si


Asumsi yang dibutuhkan untuk menjalankan
Analisis Faktor
• No outlier: Asumsikan bahwa tidak ada outlier dalam data.
• Ukuran sampel yang memadai: Jumlah observasi harus
lebih besar dari faktor.
• Tidak ada multikolinieritas sempurna: Analisis faktor
adalah teknik saling ketergantungan. Seharusnya tidak ada
multikolinieritas sempurna antara variabel.
• Homoscedasticity: Karena analisis faktor adalah fungsi
linier dari variabel yang diukur, itu tidak memerlukan
homoscedasticity antara variabel.
• Linearitas: Analisis faktor juga didasarkan pada asumsi
linearitas. Variabel non-linear juga dapat digunakan.
Namun, setelah transfer, ia berubah menjadi variabel linier.
• Data Interval: Data interval diasumsikan.

Oleh: Budi Yuniarto, SST, M.Si


Interpretasi Output R
Specific variance

Loading hasil
ekstraksi faktor

Perhatikan contoh pada var ini:


(-0.745)2 + (0.635)2 + 0.042 = 1

Communality
Communalities:
Kedua Faktor hasil ekstraksi telah
Oil menyumbangkan
mampu 0.666 76,1
Density
keragaman dari0.845
total
Crispy 0.958
Fracture 0.744
Hardness 0.584

Oleh: Budi Yuniarto, SST, M.Si


Terima Kasih

Oleh: Budi Yuniarto, SST, M.Si


Metode Multivariat Terapan
Pertemuan XII
Analisis Cluster
Pengertian

• Cluster adalah Kumpulan objek data yang:


• serupa (atau terkait) satu sama lain dalam kelompok
yang sama
• berbeda (atau tidak terkait) dengan objek dalam
kelompok lain
• Analisis cluster (atau clustering, segmentasi data, ...)
adalah suatu teknik analisis yang bertujuan menemukan
kesamaan antara data sesuai dengan karakteristik yang
ditemukan dalam data dan mengelompokkan objek data
serupa ke dalam kelompok.
• Analisis cluster → unsupervised learning : tidak
membutuhkan predefined class.
• Analisis Clustering berbeda dengan analisis klasifikasi dimana:
• Klasifikasi berkaitan dengan sejumlah kelompok yang diketahui, dengan tujuan untuk
menetapkan pengamatan baru pada kelompok-kelompok ini.
• Metode analisis kluster adalah teknik di mana tidak ada asumsi yang dibuat mengenai jumlah
kelompok atau struktur kelompok.
• Analisis cluster adalah istilah umum untuk berbagai metode numerik yang memiliki tujuan
secara umum mengungkap atau menemukan kelompok atau kelompok pengamatan yang
homogen dan terpisah dari kelompok lain.
• Penggunaan analisis cluster:
• Sebagai alat analisis yang berdiri sendiri, untuk menjelaskan tentang struktur dari distribusi
data
• Sebagai langkah preprocessing untuk algoritma lain (analisis lainnya)
Permasalahan dalam Clustering

• Teknik pengelompokan pada dasarnya mencoba memformalkan apa yang dilakukan


pengamat manusia dengan baik dalam dua atau tiga dimensi.
• Cluster diidentifikasi oleh penilaian jarak relatif antara titik.
• Teknik Clustering memiliki sejumlah masalah teknis:
• Secara teori, salah satu cara untuk menemukan solusi terbaik adalah dengan mencoba
setiap pengelompokan yang mungkin dari semua objek - proses optimasi yang disebut
pemrograman integer.
• Untuk melakukan metode seperti itu sulit, jika bukan tidak mungkin. Untuk set data
berukuran wajar dengan n objek (baik variabel atau individu), jumlah cara
pengelompokan n objek ke dalam k grup adalah:

• Sebagai contoh, ada lebih dari empat triliun kemungkinan cara agar 25 objek
dapat dikelompokkan menjadi 4 kelompok - solusi mana yang terbaik?
Metode Heuristik

• Untuk itu metode heuristik telah dikembangkan untuk memungkinkan


pengelompokan objek yang cepat ke dalam kelompok.
• Metode seperti ini disebut heuristik karena tidak menjamin bahwa solusi akan
optimal (terbaik), hanya saja solusinya akan lebih baik daripada kebanyakan.
• Input metode heuristik pengelompokan adalah dalam bentuk ukuran kesamaan
(similarity) atau ketidaksamaan (dissimilarity).
• Hasil metode heuristik sebagian besar tergantung pada ukuran similarity /
dissimilarity yang digunakan oleh prosedur.
• Prosedur pengelompokan dalam analisis cluster bisa hierarkis, non-hierarkis, atau
prosedur dua langkah.
Ukuran similarity/dissimilarity

• Karena analisis cluster berupaya mengidentifikasi


vektor-vektor pengamatan yang memiliki kemiripan
dan mengelompokkannya ke dalam kelompok-
kelompok, banyak teknik yang menggunakan indeks
kesamaan (similarity) atau kedekatan (proximity)
antara masing-masing pasangan pengamatan.
• Ukuran proximity yang umum digunakan adalah jarak
antara dua pengamatan.
• Pemilihan metrik jarak yang digunakan perlu
mempertimbangkan hal-hal berikut:
• Sifat variabel (mis., Diskrit, kontinu, biner).
• Skala pengukuran (nominal, ordinal, interval, atau
rasio).
• Sifat masalah yang diteliti.
Jenis-jenis metrik jarak

• Numerical Data
• Euclidean Distance, Squared Euclidean Distance, Normalized Squared Euclidean Distance, Manhattan Distance,
Chessboard Distance, Bray-Curtis Distance, Canberra Distance, Cosine Distance, Correlation Distance, Binary
Distance, Warping Distance, Canonical Warping Distance
• Boolean Data
• Hamming Distance, Jaccard Dissimilarity, Matching Dissimilarity, Dice Dissimilarity, Rogers-Tanimoto Dissimilarity,
Russell-Rao Dissimilarity, Sokal-Sneath Dissimilarity, Yule Dissimilarity
• String Data
• Edit Distance, Damerau-Levenshtein Distance, Hamming Distance, Smith-Waterman Similarity, Needleman-Wunsch
Similarity
Jarak Euclidean vs Mahalanobis

• Jarak euclidean:

• Jarak mahalanobis:
Metode Hierarkhi
Metode Hierarkhi

• Prosedur hierarkis dalam analisis cluster ditandai oleh pengembangan struktur


mirip pohon.
• Teknik clustering hierarki dilakukan dengan mengambil satu set objek dan
mengelompokkan satu set sekaligus.
• Ada dua jenis metode pengelompokan hierarkis:
• Metode hirarki aglomeratif.
• Metode hierarki divisive.
• Metode clustering hierarkhi menggunakan matriks jarak sebagai kriteria
pengelompokan. Metode ini tidak memerlukan jumlah cluster k sebagai input,
tetapi membutuhkan kondisi terminasi untuk menentukan cluster yang terbentuk.
Metode aglomeratif

• Metode pengelompokan agglomeratif dimulai


pertama dengan objek individu. Awalnya, setiap
objek adalah satu cluster.
• Objek yang paling mirip kemudian dikelompokkan
bersama menjadi satu cluster (dengan dua objek).
• Langkah-langkah berikutnya melibatkan
penggabungan cluster sesuai dengan kesamaan atau
ketidaksamaan objek dalam cluster dengan yang di
luar cluster.
• Metode ini berakhir ketika semua objek adalah
bagian dari satu cluster
Metode divisive

• bekerja dalam arah yang berlawanan -


mulai dengan satu cluster berukuran n-
objek.
• Cluster besar kemudian dibagi menjadi
dua sub-cluster di mana objek dalam
kelompok yang berlawanan relatif jauh
satu sama lain.
• Proses ini berlanjut dengan cara yang
sama hingga ada banyak kluster sama
dengan banyaknya objek (n).
Linkage

• Dalam metode aglomeratif,


terdapat beberapa cara
berbeda untuk melakukan
penggabungan cluster:
• Single Linkage
• Complete Linkage
• Average Linkage
• Centroid
• Median
• Ward Method
Dendogram

• Untuk memvisualisasikan
proses pembentukan
cluster digunakan
dendogram. Dendrogram
adalah grafik yang
menggambarkan berbagai
cluster secara hierarki
berdasarkan urutan
penggabungan/pemisaha
nnya, disertai jarak di
mana masing-masing
cluster terbentuk.
Contoh: Clustering metode aglomeratif
• Problem: clustering analysis with agglomerative
algorithm

data matrix

Euclidean distance

distance matrix

17
• Merge two closest clusters (iteration 1)

Sekolah Tinggi Ilmu Statistik Jakarta

18
• Update distance matrix (iteration 1)

Sekolah Tinggi Ilmu Statistik Jakarta

19
• Merge two closest clusters (iteration 2)

Sekolah Tinggi Ilmu Statistik Jakarta

20
• Update distance matrix (iteration 2)

Sekolah Tinggi Ilmu Statistik Jakarta

21
• Merge two closest clusters/update distance matrix
(iteration 3)

Sekolah Tinggi Ilmu Statistik Jakarta

22
• Merge two closest clusters/update distance matrix
(iteration 4)

Sekolah Tinggi Ilmu Statistik Jakarta

23
• Final result (meeting termination condition)

Sekolah Tinggi Ilmu Statistik Jakarta

24
• Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
lifetime

4. We merge clusters E and (D, F)


5 into ((D, F), E) at distance 1.00
5. We merge clusters ((D, F), E) and C
4 into (((D, F), E), C) at distance 1.41
3 6. We merge clusters (((D, F), E), C)
2 and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
object
Sekolah Tinggi Ilmu Statistik Jakarta

25
Metode Non-hierarkhi : K-Mean cluster
• MacQueen [25] yang memberikan istilah K-means untuk menjelaskan
suatu algoritma yang mengelompokkan setiap item ke dalam cluster yang
memiliki centroid (mean) terdekat.
• Dalam versi sederhana, langkah-langkah K-Means adalah sebagai berikut:

1) Partisi semua item menjadi K cluster awal.


2) Lanjutkan dengan mendaftar semua item, dan menempatkannya ke
dalam cluster yang memiliki centroid terdekat. Hitung ulang
centroid untuk cluster yang menerima item baru dan cluster yang
kehilangan item (karena berpindah cluster).
3) Ulangi Langkah 2 sampai tidak ada lagi item yang bisa berpindah
cluster.
Seed Points
• Alternatif dari membuat partisi untuk semua item, untuk membuat k cluster awal
kita juga bisa membuat k centroid awal (yang disebut sebagai seed points) kemudian
kita bisa lanjut ke langkah 2
• Terdapat beberapa cara untuk menentukan seed points:
• select k items at random (perhaps separated by a specified minimum distance),
• choose the first k points in the data set (again subject to a minimum distance
requirement),
• select the k points that are mutually farthest apart,
• find the k points of maximum density,
• or specify k regularly spaced points in a gridlike pattern (these would not be actual data
points).
• Setelah seed dipilih, titik-titik lainnya akan dimasukkan ke dalam cluster awal berdasarkan
jarak ke seed terdekat (berdasarkan jarak Euclidean).
Proses K-Means Clustering

K=2

Arbitrarily Update the


partition objects cluster
into k groups centroids

The initial data set Loop if Reassign objects


needed
◼ Partisi objek ke dalam k nonempty subsets
◼ Ulangi
◼ Hitung centroid (mean point) untuk tiap
partisi Update the
cluster
◼ Masukkan tiap objek ke dalam cluster centroids
yang memiliki centroid terdekat
◼ Sampai tidak ada perubahan keanggotaan
cluster
29
Example
• Misal terdapat dua variablel X1 and X2 untuk 4 item A, B, C, dan D. Data diberikan pada tabel
berikut:

Lakukan pengelompokkan dengan metode aglomeratif dan k-mean cluster (k = 2)!


ANALISIS PEUBAH GANDA
PERTEMUAN XIII
ANALISIS DISKRIMINAN LINIER
PENGERTIAN

X2
 Untuk mengetahui apakah terdapat perbedaan antara 2
kelompok sampel atau lebih maka kita bisa menggunakan uji T2-
Hotellings ataupun MANOVA. Namun kita tidak bisa
menjelaskan perbedaan tersebut, sehingga kita tidak bisa
melakukan prediksi pada data baru.
 Untuk menjelaskan perbedaan tersebut, kita bisa menggunakan
analisis diskriminan. Dengan mengetahui perbedaan antar
kelompok, maka kita bisa melakukan prediksi pada data baru.
Pada analisis diskriminan, yang menjadi prediktor adalah variabel
kontinyu dan yang menjadi variabel respon (dependen) adalah
variabel kategorik. → Kebalikan dari MANOVA. X1
 Jadi, analisis diskriminan adalah analisis yang berusaha untuk
menjelaskan perbedaan antara 2 kelompok data atau lebih.
 Contoh kasus:
 - Seorang direktur bank ingin mengetahui
perbedaan nasabah yang kreditnya lancar dengan
nasabah yang kreditnya macet dilihat dari
variabel umur nasabah, pendapatan, jumlah
anggota rumah tangga, dan besarnya kredit.
 - Seorang dokter melakukan penelitian untuk
mengetahui perbedaan antara pasiennya yang
bisa sembuh dan gagal sembuh.
DASAR-DASAR ANALISIS DISKRIMINAN

 Analisis diskriminan berupaya menemukan seperangkat


fungsi yang dapat digunakan untuk memisahkan
pengamatan menjadi kelompok-kelompok yang diketahui.
 Untuk menggunakan prosedur ini beberapa elemen harus
diketahui sebelum analisis:
 Jumlah grup harus diketahui.
 Kumpulan data “pelatihan” dengan indikator keanggotaan
kelompok untuk setiap mata pelajaran harus ada.
 Dengan analisis diskriminan kita dapat mengklasifikasikan
pengamatan baru tanpa mengetahui keanggotaan kelompok
sebelumnya.
• Asumsi: sama dengan asumsi pada
MANOVA, yaitu:
o multivariate normality,
o independence of cases,
o homogeneity of group covariances

 Variabel-variabel harus memenuhi asumsi normalitas


untuk setiap kelompok dengan tujuan untuk
memperkecil peluang misklasifikasi.

 Artinya analisis discriminant lebih cocok untuk


variabel kontinyu daripada variabel kategorik.
 Distribusi normal suatu variabel pada setiap kelompok bisa memiliki mean yang berbeda, namun mereka
harus memiliki standar deviasi yang sama. Mengapa? Perhatikan gambar berikut!
FUNGSI DISKRIMINAN
 Perbedaan antar kelompok pada
analisis diskriminan bisa dijelaskan
oleh suatu fungsi yang disebut
sebagai fungsi diskriminan.
 Fungsi diskriminan adalah
kombinasi linear dari variabel
yang memisahkan kelompok
terbaik.
 Pada gambar di samping, w adalah
fungsi diskriminan yang
memisahkan dengan baik antara
kelompok o dengan kelompok +.
THE DISCRIMINANT FUNCTION FOR TWO GROUPS
 Misal terdapat 2 populasi untuk dibandingkan dan diasumsikan memiliki matriks kovarians yang sama
tapi memiliki vektor mean μ1 dan μ2 berbeda.

Pengamatan Populasi
I II
1 𝒚𝟏𝟏𝟏 𝒚𝟏𝟏𝟏
𝒚𝟏𝟏𝟐 𝒚𝟏𝟏𝟐
𝒚𝟏𝟏 = ⋮ 𝒚𝟐𝟏 = ⋮
𝒚𝟏𝟏𝒑 𝒚𝟏𝟏𝒑
2 𝒚𝟏𝟐𝟏 𝒚𝟏𝟐𝟏
𝒚𝟏𝟐𝟐 𝒚𝟏𝟐𝟐
𝒚𝟏𝟐 = ⋮ 𝒚𝟐𝟐 = ⋮
𝒚𝟏𝟐𝒑 𝒚𝟏𝟐𝒑
⋮ ⋮ ⋮
n 𝒚𝟏𝒏𝟏 𝒚𝟏𝒏𝟏
𝒚𝟏𝒏𝟐 𝒚𝟏𝒏𝟐
𝒚𝟏𝒏 = ⋮ 𝒚𝟐𝒏 = ⋮
𝒚𝟏𝒏𝒑 𝒚𝟏𝒏𝒑
Fungsi diskriminan yang memisahkan 2 kelompok adalah
sebagai berikut:
• Misal terdapat 2 kelompok data dengan p variabel
(y1, …, yp) memiliki matriks kovarians sama namun
memiliki vektor mean berbeda, maka fungsi
diskriminannya adalah suatu kombinasi linier dari
p variabel tersebut yang memaksimalkan jarak
antara kedua vektor mean tersebut, yaitu:
z = a’ y
dimana jarak maksimum antar dua kelompok
terjadi pada
CUTTING SCORE

 Dalam fungsi diskriminan 2 kelompok, cutting score akan


digunakan untuk mengklasifikasikan 2 kelompok secara unik.
 Cutting score adalah skor yang digunakan untuk membangun
matriks klasifikasi.
 Secara praktis, kita biasanya menentukan suatu objek masuk
kelompok mana dengan menghitung cutting score sebagai
nilai tengah di antara dua centroid.
𝑧1ҧ + 𝑧2ҧ
𝑧𝑐𝑢𝑡 =
2
 Cara lain kita menggunakan prior probabilitas.
𝑧𝑐𝑢𝑡 = 𝜋1 𝑧1ҧ + 𝜋2 𝑧2ҧ
PRIOR PROBABILITY

 Prior probabilitiy adalah peluang suatu observasi akan diklasifikasikan ke dalam suatu kelas
tanpa ada pengetahuan apapun mengenai X ➔ apapun nilai X berapa peluang suatu observasi
masuk ke suatu kelas, biasanya berdasarkan data atau penelitian sebelumnya.
 Prior probability dari kelas k dinotasikan dengan πk, dengan ∑ πk = 1
 Biasanya πk diestimasi dengan frekuensi empiris dari data training, maka:
EXAMPLE

 Misal di pabrik baja, baja diproduksi dengan dua rolling temperatures berbeda. Sampel diambil dari
kedua jenis baja yang diproduksi dengan temperatur berbeda tersebut, dan datanya bisa dilihat pada
tabel dibawah. Variabel yang diukur sebagai berikut: y1 = yield point and y2 = ultimate strength.
Figure 3
Ultimate strength and yield
point for steel rolled at two
temperatures.

Terlihat bahwa sebaran titik-


titik observasi berdasarkan
kedua variabel secara visual
bisa dipisahkan. Namun jika
kita lihat pada setiap dimensi
baik y1 atau y2, maka data
akan overlap (tak bisa
dipisahkan).
Maka dari data kita hitung sebagai berikut:

Pertama kita cek dengan uji t univariat untuk melihat apakah pengamatan bisa dibedakan menurut masing-
masing variabel.

Terlihat dari uji t univariat, kedua jenis baja tidak berbeda signifikan pada setiap dimensi. Namun dari
plot sebelumnya secara visual kita tahu bahwa keduanya bisa dibedakan.
• Karena itu kita perlu mencari suatu fungsi diskriminan dimana titik-titik data bisa diproyeksikan pada
fungsi tersebut, dan tidak overlap antar kedua kelompok.

• Dalam contoh ini, maka kita hanya butuh fungsi diskriminan berdimensi tunggal. Melalui proses
matematis, diperoleh fungsi diskriminan berikut:

• dimana a diperoleh dari:


Titik-titik data diproyeksikan terhadap fungsi diskriminan dengan menghitung nilai z untuk setiap titik
pengamatan di kedua kelompok. Hasil proyeksi tersebut adalah sebagai berikut:
SYNTAX R (MENGGUNAKAN PACKAGE MASS)
> # Linear Discriminant An. with Jacknifed Prediction
> temp<-read.csv("temperatur.csv")
> library(MASS)
> fit<-lda(Temperatur~y1+y2,data=temp)
> fit
>
> # Assess the accuracy of the prediction
> # percent correct for each category of G
> ct <- table(temp$Temperatur, fit$class)
> diag(prop.table(ct, 1))
> # total percent correct
> sum(diag(prop.table(ct)))
Call:
lda(Temperatur ~ y1 + y2, data = temp)

Prior probabilities of groups:


1 2
0.4166667 0.5833333

Group means:
y1 y2
1 36.4 62.60000
2 39.0 60.42857

Coefficients of linear discriminants:


LD1
y1 0.5704591
y2 -0.6355602
Untuk memproyeksikan setiap titik pengamatan terhadap fungsi diskriminan:
> fit.value<-predict(fit,temp[,1:2])
> fit.value

$x
LD1 Argumen data ini bisa kita ganti dengan:
1 -1.9573434 - data testing, untuk memvalidasi
2 -0.8815264 fungsi diskriminan
3 -3.3586662 - data baru, untuk mengklasifikasikan
4 -1.0117288 data baru tersebut ke dalam kelas
5 -1.1419312
6 1.0902555
7 0.3895940
8 1.5305121
9 0.8298507
10 0.6996484
11 0.5694460
12 3.2418893
To see posterior probability for
membership:

$class
> fit2<-lda(Temperatur~y1+y2,data=temp, [1] 1 1 1 1 1 2 2 2 2 2 2 2
na.action="na.omit", CV=TRUE) Levels: 1 2
> fit2
$posterior
1 2
1 9.791546e-01 2.084543e-02
2 7.666647e-01 2.333353e-01
3 9.999839e-01 1.609932e-05
4 8.296252e-01 1.703748e-01
5 7.812124e-01 2.187876e-01
6 3.886802e-02 9.611320e-01
7 1.643678e-01 8.356322e-01
8 7.217866e-03 9.927821e-01
9 4.282416e-02 9.571758e-01
10 7.677870e-02 9.232213e-01
11 3.124067e-01 6.875933e-01
12 6.274357e-08 9.999999e-01

$terms
Temperatur ~ y1 + y2
PRAKTIKUM

You might also like