0% found this document useful (0 votes)

93 views54 pages

R PCA (Principal Component Analysis) - DataCamp

The document discusses principal component analysis (PCA) in R. It introduces PCA and how it relates to eigenvalues and eigenvectors. It then performs a simple PCA on the mtcars dataset to extract principal components and creates visualizations to display the reduced data. The document covers interpreting PCA results and customizing PCA plots.

Uploaded by

UMESH D R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views54 pages

R PCA (Principal Component Analysis) - DataCamp

Uploaded by

UMESH D R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Log in Create Free Account

Luke Hayden Explore the code in this tutorial

August 9th, 2018

Open in Workspace
MUST READ R PROGRAMMING

Principal Component Analysis in R

In this tutorial, you'll learn how to use R PCA (Principal Component Analysis) to extract data with
many variables and create visualizations to display that data.

Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing you to better visualize the
variation present in a dataset with many variables. It is particularly helpful in the case of "wide" datasets, where you have
many variables for each sample. In this tutorial, you'll discover PCA in R.

More specifically, you'll tackle the following topics:

You'll first go through an introduction to PCA: you'll learn about principal components and how they relate to eigenvalues
and eigenvectors.

Then, you'll try a simple PCA with a simple and easy-to-understand data set.

Next, you'll use the results of the previous section to plot your first PCA - Visualization is very important!

You'll also see how you can get started on interpreting the results of these visualizations and

How to set the graphical parameters of your plots with the ggbiplot package!

Of course, you want your visualizations to be as customized as possible, and that's why you'll also cover some ways of
doing additional customizations to your plots!

You'll also see how you can add a new sample to your plot and you'll end up projecting a new sample onto the original PCA.

Wrap-up

Introduction to PCA
As you already read in the introduction, PCA is particularly handy when you're working with "wide" data sets. But why is that?

Well, in such cases, where many variables are present, you cannot easily plot the data in its raw format, making it difficult to
get a sense of the trends present within. PCA allows you to see the overall "shape" of the data, identifying which samples are
similar to one another and which are very different. This can enable us to identify groups of samples that are similar and
work out which variables make one group different from another.

The mathematics underlying it are somewhat complex, so I won't go into too much detail, but the basics of PCA are as
follows: you take a dataset with many variables, and you simplify that dataset by turning your original variables into a smaller
number of "Principal Components".

But what are these exactly? Principal Components are the underlying structure in the data. They are the directions where
there is the most variance, the directions where the data is most spread out. This means that we try to find the straight line
that best spreads the data out when it is projected along it. This is the first principal component, the straight line that shows
the most substantial variance in the data.

PCA is a type of linear transformation on a given data set that has values for a certain number of variables (coordinates) for a
certain amount of spaces. This linear transformation fits this dataset to a new coordinate system in such a way that the most
significant variance is found on the first coordinate, and each subsequent coordinate is orthogonal to the last and has a lesser
variance. In this way, you transform a set of x correlated variables over y samples to a set of p uncorrelated principal
components over the same samples.

Where many variables correlate with one another, they will all contribute strongly to the same principal component. Each
principal component sums up a certain percentage of the total variation in the dataset. Where your initial variables are
strongly correlated with one another, you will be able to approximate most of the complexity in your dataset with just a few
principal components. As you add more principal components, you summarize more and more of the original dataset. Adding
additional components makes your estimate of the total dataset more accurate, but also more unwieldy.

Eigenvalues and Eigenvectors

Just like many things in life, eigenvectors, and eigenvalues come in pairs: every eigenvector has a corresponding eigenvalue.
Simply put, an eigenvector is a direction, such as "vertical" or "45 degrees", while an eigenvalue is a number telling you how
much variance there is in the data in that direction. The eigenvector with the highest eigenvalue is, therefore, the first
principal component.

So wait, there are possibly more eigenvalues and eigenvectors to be found in one data set?
That's correct! The number of eigenvalues and eigenvectors that exits is equal to the number of dimensions the data set has.
In the example that you saw above, there were 2 variables, so the data set was two-dimensional. That means that there are
two eigenvectors and eigenvalues. Similarly, you'd find three pairs in a three-dimensional data set.

We can reframe a dataset in terms of these eigenvectors and eigenvalues without changing the underlying information. Note
that reframing a dataset regarding a set of eigenvalues and eigenvectors does not entail changing the data itself, you’re just
looking at it from a different angle, which should represent the data better.

Now that you've seen some of the theory behind PCA, you're ready to see all of it in action!

A Simple PCA
In this section, you will try a PCA using a simple and easy to understand dataset. You will use the mtcars dataset, which is
built into R. This dataset consists of data on 32 models of car, taken from an American motoring magazine (1974 Motor Trend
magazine). For each car, you have 11 features, expressed in varying units (US units), They are as follows:

* mpg : Fuel consumption (Miles per (US) gallon): more powerful and heavier cars tend to consume more fuel.

* cyl : Number of cylinders: more powerful cars often have more cylinders

* disp : Displacement (cu.in.): the combined volume of the engine's cylinders

* hp : Gross horsepower: this is a measure of the power generated by the car

* drat : Rear axle ratio: this describes how a turn of the drive shaft corresponds to a turn of the wheels. Higher values will
decrease fuel efficiency.
* wt : Weight (1000 lbs): pretty self-explanatory!

* qsec : 1/4 mile time: the cars speed and acceleration

* vs : Engine block: this denotes whether the vehicle's engine is shaped like a "V", or is a more common straight shape.

* am : Transmission: this denotes whether the car's transmission is automatic (0) or manual (1).

* gear : Number of forward gears: sports cars tend to have more gears.

* carb : Number of carburetors: associated with more powerful engines

Note that the units used vary and occupy different scales.

Compute the Principal Components

Because PCA works best with numerical data, you'll exclude the two categorical variables ( vs and am ). You are left with a
matrix of 9 columns and 32 rows, which you pass to the prcomp() function, assigning your output to mtcars.pca . You will
also set two arguments, center and scale , to be TRUE . Then you can have a peek at your PCA object with summary() .

mtcars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE,scale. = TRUE)

summary(mtcars.pca)

## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.3782 1.4429 0.71008 0.51481 0.42797 0.35184
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375
## Cumulative Proportion 0.6284 0.8598 0.91581 0.94525 0.96560 0.97936
## PC7 PC8 PC9
## Standard deviation 0.32413 0.2419 0.14896
## Proportion of Variance 0.01167 0.0065 0.00247
## Cumulative Proportion 0.99103 0.9975 1.00000

You obtain 9 principal components, which you call PC1-9. Each of these explains a percentage of the total variation in the
dataset. That is to say: PC1 explains 63% of the total variance, which means that nearly two-thirds of the information in the
dataset (9 variables) can be encapsulated by just that one Principal Component. PC2 explains 23% of the variance. So, by
knowing the position of a sample in relation to just PC1 and PC2, you can get a very accurate view on where it stands in
relation to other samples, as just PC1 and PC2 can explain 86% of the variance.

Let's call str() to have a look at your PCA object.

str(mtcars.pca)

## List of 5
## $ sdev : num [1:9] 2.378 1.443 0.71 0.515 0.428 ...
## $ rotation: num [1:9, 1:9] -0.393 0.403 0.397 0.367 -0.312 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:9] "mpg" "cyl" "disp" "hp" ...
## .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:9] 20.09 6.19 230.72 146.69 3.6 ...
## ..- attr(*, "names")= chr [1:9] "mpg" "cyl" "disp" "hp" ...
## $ scale : Named num [1:9] 6.027 1.786 123.939 68.563 0.535 ...
## ..- attr(*, "names")= chr [1:9] "mpg" "cyl" "disp" "hp" ...
## $ x : num [1:32, 1:9] -0.664 -0.637 -2.3 -0.215 1.587 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"

I won't describe the results here in detail, but your PCA object contains the following information:

The center point ( $center ), scaling ( $scale ), standard deviation( sdev ) of each principal component

The relationship (correlation or anticorrelation, etc) between the initial variables and the principal components (
$rotation )

The values of each sample in terms of the principal components ( $x )

Plotting PCA
Now it's time to plot your PCA. You will make a biplot, which includes both the position of each sample in terms of PC1 and
PC2 and also will show you how the initial variables map onto this. You will use the ggbiplot package, which offers a user-
friendly and pretty function to plot biplots. A biplot is a type of plot that will allow you to visualize how the samples relate to
one another in our PCA (which samples are similar and which are different) and will simultaneously reveal how each variable
contributes to each principal component.

Before you can get started, don't forget to first install ggbiplot !

library(devtools)
install_github("vqv/ggbiplot")

Next, you can call ggbiplot on your PCA:

library(ggbiplot)

ggbiplot(mtcars.pca)
The axes are seen as arrows originating from the center point. Here, you see that the variables hp , cyl , and disp all
contribute to PC1, with higher values in those variables moving the samples to the right on this plot. This lets you see how
the data points relate to the axes, but it's not very informative without knowing which point corresponds to which sample
(car).

You'll provide an argument to ggbiplot : let's give it the rownames of mtcars as labels . This will name each point with
the name of the car in question:

ggbiplot(mtcars.pca, labels=rownames(mtcars))
Now you can see which cars are similar to one another. For example, the Maserati Bora, Ferrari Dino and Ford Pantera L all
cluster together at the top. This makes sense, as all of these are sports cars.

How else can you try to better understand your data?

Interpreting the results

Maybe if you look at the origin of each of the cars. You'll put them into one of three categories (cartegories?), one each for
the US, Japanese and European cars. You make a list for this info, then pass it to the groups argument of ggbiplot. You'll also
set the ellipse argument to be TRUE , which will draw an ellipse around each group.

mtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Eur

ggbiplot(mtcars.pca,ellipse=TRUE, labels=rownames(mtcars), groups=mtcars.country)

Now you see something interesting: the American cars form a distinct cluster to the right. Looking at the axes, you see that
the American cars are characterized by high values for cyl , disp , and wt . Japanese cars, on the other hand, are
characterized by high mpg . European cars are somewhat in the middle and less tightly clustered than either group.

Of course, you have many principal components available, each of which map differently to the original variables. You can
also ask ggbiplot to plot these other components, by using the choices argument.

Let's have a look at PC3 and PC4:

ggbiplot(mtcars.pca,ellipse=TRUE,choices=c(3,4), labels=rownames(mtcars), groups=mtcars.country)

You don't see much here, but this isn't too surprising. PC3 and PC4 explain very small percentages of the total variation, so it
would be surprising if you found that they were very informative and separated the groups or revealed apparent patterns.

Let's take a moment to recap: having performed a PCA using the mtcars dataset, we can see a clear separation between
American and Japanese cars along a principal component that is closely correlated to cyl , disp , wt , and mpg . This
provides us with some clues for future analyses; if we were to try to build a classification model to identify the origin of a car,
these variables might be useful.

Graphical parameters with ggbiplot

There are also some other variables you can play with to alter your biplots. You can add a circle to the center of the dataset (
circle argument):

ggbiplot(mtcars.pca,ellipse=TRUE,circle=TRUE, labels=rownames(mtcars), groups=mtcars.country)

You can also scale the samples ( obs.scale ) and the variables ( var.scale ):
ggbiplot(mtcars.pca,ellipse=TRUE,obs.scale = 1, var.scale = 1, labels=rownames(mtcars), groups=mtcars.country)
You can also remove the arrows altogether, using var.axes .
ggbiplot(mtcars.pca,ellipse=TRUE,obs.scale = 1, var.scale = 1,var.axes=FALSE, labels=rownames(mtcars), groups=mtcars.country)
Customize ggbiplot
As ggbiplot is based on the ggplot function, you can use the same set of graphical parameters to alter your biplots as you
would for any ggplot . Here, you're going to:

Specify the colours to use for the groups with scale_colour_manual()

Add a title with ggtitle()

Specify the minimal() theme

Move the legend with theme()

ggbiplot(mtcars.pca,ellipse=TRUE,obs.scale = 1, var.scale = 1, labels=rownames(mtcars), groups=mtcars.country) +

scale_colour_manual(name="Origin", values= c("forest green", "red3", "dark blue"))+
ggtitle("PCA of mtcars dataset")+
theme_minimal()+
theme(legend.position = "bottom")
Adding a new sample
Okay, so let's say you want to add a new sample to your dataset. This is a very special car, with stats unlike any other. It's
super-powerful, has a 60-cylinder engine, amazing fuel economy, no gears and is very light. It's a "spacecar", from Jupiter.

Can you add it to your existing dataset and see where it places in relation to the other cars?

You will add it to mtcars , creating mtcarsplus , then repeat your analysis. You might expect to be able to see which
region's cars it is most like.

spacecar <- c(1000,60,50,500,0,0.5,2.5,0,1,0,0)

mtcarsplus <- rbind(mtcars, spacecar)

mtcars.countryplus <- c(mtcars.country, "Jupiter")

mtcarsplus.pca <- prcomp(mtcarsplus[,c(1:7,10,11)], center = TRUE,scale. = TRUE)

ggbiplot(mtcarsplus.pca, obs.scale = 1, var.scale = 1, ellipse = TRUE, circle = FALSE, var.axes=TRUE, labels=c(rownames(mtcars),

scale_colour_manual(name="Origin", values= c("forest green", "red3", "violet", "dark blue"))+
ggtitle("PCA of mtcars dataset, with extra sample added")+
theme_minimal()+
theme(legend.position = "bottom")
But that would be a naive assumption! The shape of the PCA has changed drastically, with the addition of this sample. When
you consider this result in a bit more detail, it actually makes perfect sense. In the original dataset, you had strong
correlations between certain variables (for example, cyl and mpg ), which contributed to PC1, separating your groups from
one another along this axis. However, when you perform the PCA with the extra sample, the same correlations are not
present, which warps the whole dataset. In this case, the effect is particularly strong because your extra sample is an extreme
outlier in multiple respects.

If you want to see how the new sample compares to the groups produced by the initial PCA, you need to project it onto that
PCA.

Project a new sample onto the original PCA

What this means is that the principal components are defined without relation to your spacecar sample, then you compute
where spacecar is placed in relation to the other samples by applying the transformations that your PCA has produced. You
can think of this as, instead of getting the mean of all the samples and allowing spacecar to skew this mean, you get the
mean of the rest of the samples and look at spacecar in relation to this.

What this means is that you simply scale the values for spacecar in relation to the PCA's center ( mtcars.pca$center ).
Then you apply the rotation of the PCA matrix to the spacecar sample. Then you can rbind() the projected values for
spacecar to the rest of the pca$x matrix and pass this to ggbiplot as before:

s.sc <- scale(t(spacecar[c(1:7,10,11)]), center= mtcars.pca$center)

s.pred <- s.sc %*% mtcars.pca$rotation

mtcars.plusproj.pca <- mtcars.pca

mtcars.plusproj.pca$x <- rbind(mtcars.plusproj.pca$x, s.pred)
ggbiplot(mtcars.plusproj.pca, obs.scale = 1, var.scale = 1, ellipse = TRUE, circle = FALSE, var.axes=TRUE, labels=c(rownames(mtc
scale_colour_manual(name="Origin", values= c("forest green", "red3", "violet", "dark blue"))+
ggtitle("PCA of mtcars dataset, with extra sample projected")+
theme_minimal()+
theme(legend.position = "bottom")
This result is drastically different. Note that all the other samples are back in their initial positions, while spacecar is placed
somewhat near the middle. Your extra sample is no longer skewing the overall distribution, but it can't be assigned to a
particular group.

But which is better, the projection or the recomputation of the PCA?

It depends somewhat on the question that you are trying to answer; the recomputation shows that spacecar is an outlier,
the projection tells you that you can't place it in one of the existing groups. Performing both approaches is often useful when
doing exploratory data analysis by PCA. This type of exploratory analysis is often a good starting point before you dive more
deeply into a dataset. Your PCAs tell you which variables separate American cars from others and that spacecar is an outlier
in our dataset. A possible next step would be to see if these relationships hold true for other cars or to see how cars cluster
by marque or by type (sports cars, 4WDs, etc).

Wrap-up
So, there you have it!

You have learned the principles of PCA, how to create a biplot, how to fine-tune that plot and have seen two different
methods for adding samples to a PCA analysis. Thanks for reading!

If you would like to learn more about R, take DataCamp's free Introduction to R course.

Use DataCamp Workspace to experiment with the code in this tutorial!

Open in Workspace
78

COMMENTS

David Passmore
14/08/2018 08:07 PM

Superb explanation. Posted for my students to see.

Wishmore Stanley
20/08/2018 08:54 PM

very insightful and practical was having difficulty finding an application for pca. thanks.

Leandro Marx
13/09/2018 02:00 AM

Is there a way to do this with ggplot2 tool? ggbiplot isn't available for 3.5.1 :(

Albert de Roos
23/09/2018 08:12 PM

Use the install_github from devtools that attempts to install a package directly from GitHub. Did the trick for me:

install.packages("devtools")

library(devtools)

install_github("vqv/ggbiplot")
library(ggbiplot)

Simone Rabeling
16/10/2018 02:35 AM

Thanks!

Sohil Gala
15/11/2018 04:31 AM

i get this "ERROR: lazy loading failed for package 'ggbiplot'" any ideas on what to do here

Jyothsna Harithsa
23/04/2019 03:53 AM

install.packages("devtools")

install.packages("fs")

library(devtools)

install_github("vqv/ggbiplot")

library(ggbiplot)

RichaKr Kumari
13/09/2018 02:13 PM

Best ever Tutorial for PCA. Thank you for sharing your knowledge.
3

Kevin Pan
18/09/2018 11:18 PM

Thank you. It helps me a lot.

Uli Wellner
30/09/2018 06:06 PM

perfect shortcut intro. thanks !

Dr Pankaj Kumar Medhi

04/10/2018 09:17 PM

Great explanation! new insights about the projection of new sample.

Miguel Barbosa
17/10/2018 03:41 AM

Thanks for the great tutorial!

Sourav Mandal
18/10/2018 02:15 PM
Thanks very much for touching upon a much awaited topic! Are you planning to do a tutorial on things like partial least square
regression (PLSR) etc?

Harshali Chaudhari
23/10/2018 05:17 PM

Very useful information... very well explained.. Thank you for sharing it..

Lucas Ishikawa
10/11/2018 01:02 AM

Amazing! So rich and still simple in every single detail

Thanks for sharing

Dexter Pante
10/11/2018 06:02 PM

Just a question, in the graph where you plotted PC1 and PC2, you said that "hp, cyl and disp all contribute to PC1". Does this mean that
wt and carb do not contribute to PC1? And what was the explanation for this. I'd like also to know of all the variables you mentioned
that contribute to PC1 which one has the most contribution?

Thank you for taking time to read my comment. I hope you could enlighten me on this.

Luke Hayden
13/11/2018 01:07 AM

hp, cyl and disp contribute most strongly to PC1, while wt and carb also contribute, but less strongly. Strictly speaking, every
variable will contribute to each PC. The degree of correlation between the initial variable and the PC is its weighting. You can
extract the weightings from your prcomp object via prcomp$rotation
2

Mark Vermeersch
18/01/2019 08:06 PM

Actually cyl, disp and wt contribute more to PC1. cyl 0.402; disp 0.397; hp 0.367; wt 0.3734. Maybe this could be updated in the
tutorial? Great work!

Philip Doyle
14/11/2018 07:24 PM

Great tutorial! Exceptionally clear writing (which I find is rare for R tutorials)

Philip Doyle
15/11/2018 04:42 PM

For some reason I can't utilise ggbiplot after installing. After entering - library(ggbiplot) - I get the response:

Loading required package: ggplot2

Error: package or namespace load failed for ‘ggplot2’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):

namespace ‘rlang’ 0.2.0 is already loaded, but >= 0.2.1 is required

Error: package ‘ggplot2’ could not be loaded

In addition: Warning message:

package ‘ggplot2’ was built under R version 3.4.4

alvaro barroqueiro
22/03/2019 04:02 PM
Philipe, did you try to install and run a older version of R?

Daniel Gonzalez
25/07/2019 12:46 AM

Apparently what you need is to update rlang

Try quit the session and then go back and run install.packages("rlang") (You may try this without quiting R but it will probably ask
you to finish the session anyway).
Doing that will update your package (don´t try update.packages() as it will update all your loaded packages, which may create new
compatibility problems)

JEREMIAH KABISSA
16/11/2018 01:17 AM

Dear Luke

How are you. My name is Jeremiah Joe Kabissa a student of KU Leuven, I real want to learn different tactics of using R, for various
analytical tools. However for now I real like to learn PCA in R, FA in R, PLS in R and the like in relation to the former three.

Sincerely

Jeremiah

Simone Schütz
20/11/2018 08:22 PM

Thanks. How to adapted the PCA if my data includes missing values?

Tram Ly
22/11/2018 12:52 AM

Dear Luke ,

Thanks you so much for your lesson

I don't understand about interpreter result with " the Japanese cars, on the other hand, are characterized by high mpg" because I
think low mpg is right because the arrow of mpg is on negative side.

I am so appreciate if you explain

Thanks you so much

Abdellatif EL MSAYRYB
11/12/2018 02:46 PM

Hello, i'm trying to install the ggbiplot package but doesn't work for me, is there a difference between the ggplot package and ggbiplot
in the python 3 ?

Luke Hayden
15/12/2018 04:52 PM

Both ggplot and ggbiplot are packages for R, not for Python. ggbiplot depends on ggplot, however.

Elijah Juma
24/12/2018 10:53 AM
Hi Luke,

I'd like to understand the intuition behind how the grouping has been done. Would you be able to offer additional explanation?

Izzy Bizzy
07/01/2019 12:48 AM

Hello,

I have heard that the output obtained through PCA analysis can be used as predictors in regression. For example, if you have two data
sets: A contains the results of an experiment with each row representing a participant response, B contains (highly variable) social
information about each participant in the study. I would like to perform a PCA analysis on data set B and use the resulting vectors as
predictor variables in a mixed effects regression performed on data set A. In other words, I am trying to test how can social factors
account for participant behavior. How would I go about including those principal components in my regression? I am familiar with
mixed effects regression in R but if anybody can provide some sample code to achieve this, that would be great.

Thanks!

David C
13/01/2019 03:39 AM

Like many others, ggbiplot is not installing. I get the error.

Error in install_github("vqv/ggbiplot") :

could not find function "install_github"

To all readers, this tutorial no longer has use as ggbiplot appears to no longer be available. You'll need other packages for PCA, not
ggbiplot
1

Waite Cheung
13/01/2019 01:04 PM

Hi, David. First, you should install devtools, then you can use install_github function.

install.packages("devtools")

library(devtools)

install_github("vqv/ggbiplot")

Good luck!

any problem, contact me: [email protected]

Jyothsna Harithsa
23/04/2019 03:52 AM

install.packages("devtools")

install.packages("fs")

library(devtools)

install_github("vqv/ggbiplot")

library(ggbiplot)

Isabela Pichardo
28/11/2019 12:36 AM

Great! Thanks!
1

Sobia Ahmed
21/02/2019 12:52 PM

A great tutorial on PCA. Thanks for this

alvaro barroqueiro
22/03/2019 01:23 AM

> library(ggbiplot)

Error in library(ggbiplot) : there is no package called ‘ggbiplot’

does not exist a package with a name 'ggbiplot' ...

alvaro barroqueiro
22/03/2019 01:24 AM

ou I saw tge answer up there :)

alvaro barroqueiro
22/03/2019 01:35 AM

however ...

> library(ggbiplot)
Error in library(ggbiplot) : there is no package called ‘ggbiplot’

alvaro barroqueiro
23/03/2019 07:39 PM

Warning in install.packages :

package ‘ggbiplot’ is not available (for R version 3.5.3)

can I install older versions of R?

Jyothsna Harithsa
23/04/2019 03:51 AM

install.packages("devtools")

install.packages("fs")

library(devtools)

install_github("vqv/ggbiplot")

library(ggbiplot)

Anh Pham
27/03/2019 09:23 AM

how to download dataset "mtcars"? Thank you

1
hellozah
02/04/2019 05:56 AM

mtcars is installed with R, that is, it's built in. If you already have R installed you can view the mtcars data set by typing mtcars at the
prompt.

Mohan Arthanari
05/04/2019 11:02 PM

Warning in install.packages :

package ‘ggbiplot’ is not available (for R version 3.5.0)

i cant install this package

ellen plantsoil
08/04/2019 02:22 AM

Hi Mohan,

I had the same problem, but there is away to fix .

library(devtools)

Warning message:

package ‘devtools’ was built under R version 3.4.4

> install_github("vqv/ggbiplot")

Downloading GitHub repo vqv/ggbiplot@master

These packages have more recent versions available.

Which would you like to update?

1: ggplot2 (3.1.0 -> 3.1.1) [CRAN] 2: gtable (0.2.0 -> 0.3.0) [CRAN]

3: lazyeval (0.2.1 -> 0.2.2) [CRAN] 4: munsell (0.4.3 -> 0.5.0) [CRAN]

5: R6 (2.2.2 -> 2.4.0) [CRAN] 6: rlang (0.3.1 -> 0.3.4) [CRAN]

7: scales (0.5.0 -> 1.0.0) [CRAN] 8: stringi (1.1.7 -> 1.4.3) [CRAN]

9: stringr (1.3.1 -> 1.4.0) [CRAN] 10: tibble (1.4.2 -> 2.1.1) [CRAN]

11: CRAN packages only 12: All

13: None

Enter one or more numbers separated by spaces, or an empty line to cancel

Note that the "devtools" was built in R 3.4.4 version, we may have R 3.5 or somethings, so that's why it is not available.

If you skip all the update, as " an empty line to cancel", then you will get the ggbiplot!

I tried and it totally worked.

Good luck.

Zamzam Al-Rawahi
11/04/2019 10:26 AM

How to group samples if I have 930 samples instead of 32?

This would be very complicated for 930 samples.

mtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4),

Luke Hayden
26/04/2019 05:11 PM

What does your data look like? How do you identify groups? In this case, I identify country via part of the model name. You could
make a simple case_when approach if you have simple rules to assign your samples to groups.

Rafik Margaryan
23/04/2019 11:54 PM

Very nice, what I was searching for.

miriama vuiyasawa
25/04/2019 02:13 AM

How did you label the categories to get ellipses? I can't follow how you labelled them.

Abdullah Al Mahmud
25/04/2019 10:25 PM

Thanks for the great insight. However, I did not understand one aspect. How did you select and order the country names? I mean, they
weren't in the original data.

2
2

Yifan Feng
29/04/2019 10:40 AM

Very useful! Can you explain how to come up with the below codes? Thank you.

mtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3),
"US", rep("Europe", 3))

marco arena
08/05/2019 10:59 AM

excellent tutorial

Pete De Jager
13/05/2019 11:12 PM

Exceptional article, covering PCA in depth and clarifying all of the mysteries. All the examples work flawlessly and the explanations are
lucid. Many thanks!

Keiana Dunn
15/05/2019 12:14 PM

I installed install.packages("fs") and them received an error: Warning in install.packages :

package ‘library(devtools)’ is not available (for R version 3.4.4). Also Warning in install.packages :
package ‘library(ggbiplot)’ is not available (for R version 3.4.4). Any suggestions on how to get past this?

1
Graeme Dean
17/05/2019 10:38 PM

Hi everyone, I am following the tutorial and so far am delighted with the explanation, but now I have hit a wall, the package is not
available. I have tired several times to download it but always returns the same message: package ‘ggbiplot’ is not available (for R
version 3.5.2)

Luke Hayden
22/05/2019 02:26 AM

Did you install it via the install_github() function from devtools?

Tasos Baltadakis
04/06/2019 12:23 AM

Hello there, its a very helpful tutorial considering my level.

I wanted to ask if there is a command through ggbiplot to change the arrows from red to black.

Cheers

Luke Hayden
11/06/2019 12:45 AM

I don't know of any easy way to do so from within ggbiplot. You might be able to pass a colour for the arrows via ggplot' theming
functions.
3

Salahuddin Khan
20/06/2019 04:28 AM

Wow! Loads of plain information. Just need to understand the following codes:

mtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4),

ggbiplot(mtcars.pca,ellipse=TRUE, labels=rownames(mtcars), groups=mtcars.country)

Where did the numbers 3, 4, 7 etc.. after each country name come from?

Jiapeng He
25/06/2019 11:31 PM

The first number "3" represent the first 3 cars are from Japan, and the "4" represent the following 4 cars are from US. All the
numbers add up to 32, which means 32 type of cars in the mtcars data set

huiping zhou
20/07/2019 03:16 AM

I cannot install ggbiplot

> library(devtools)
> install_github("vqv/ggbiplot")

Downloading GitHub repo vqv/ggbiplot@master

Error in utils::download.file(url, path, method = method, quiet = quiet, :

cannot open URL 'https://api.github.com/repos/vqv/ggbiplot/tarball/master'

in this URL , I didn't find the ggbiplot zip file...

please help!

Debomitra Dey
06/08/2019 01:19 AM

Great tutorial. Thanks! I am unable to download the package ggbiplot.

Jack Dundas
15/08/2019 04:21 AM

Thanks for the awesome write up!

Could I get you to confirm my understanding?
In the last chart, with the spacecar, my interpretation is that it is placed right near the 0,0 position, which indicates that PC1 and PC2
don't do a great job of "explaining" the space car.
Correspondingly, the Valiant is summarized nicely by PC2, but not PC1.

Further, the contsituents of PC1 are primarily ctl, disp and wt. Does this mean that cyl, disp and wt are all highly colinear?
Can the same be said about gear and carb due to their high PC2 values?

Luke Hayden
21/08/2019 04:43 PM
"In the last chart, with the spacecar, my interpretation is that it is placed right near the 0,0 position, which indicates that PC1 and
PC2 don't do a great job of "explaining" the space car.
Correspondingly, the Valiant is summarized nicely by PC2, but not PC1. "

Not quite. You can't think of a PC as "explaining" a sample, but as summarising the differences between samples. The Valiant is
different from the Dino according to PC2, but not according to PC1. So, PC1 doesn't capture the difference between the Dino and
the Valiant, but PC2 does.

"Further, the contsituents of PC1 are primarily ctl, disp and wt. Does this mean that cyl, disp and wt are all highly colinear?"

Generally speaking, yes. Variables that contribute strongly to a PC will tend to be correlated. Pairwise plotting of variables is the
best way to tease these relationships out.

"Can the same be said about gear and carb due to their high PC2 values?"

Probably, but see above regarding pairwise plotting.

anwesha Saha
19/08/2019 07:25 PM

Hi Luke,

It an excellent tutorial. Thank you so much for this. I am constantly having trouble with installing ggbiplot. Do you have any suggestions.
I tried

> options(timeout=20000)

> install_github("vqv/ggbiplot")

Error: Failed to install 'unknown package' from GitHub:

Timeout was reached: Connection timed out after 10000 milliseconds

it still did not resolve the issue. Do you have any other options to prepare similar graphs other than ggbiplot package.

Hector Alvaro Rojas

24/08/2019 09:18 PM

Hi Luke:

Congratulations man!

You have made a great article on this topic, especially the biplot explanation.

This is a great tutorial on PCA.

Superb explanation. Very insightful and practical at the same time.

I think this is one of the best articles I have read about it.

Now, please let me know by the time you create a similar article but on the “Multidimensional scaling” topic.

My best regards to you!

Kaitlin DeAeth
26/08/2019 09:32 PM

Question. How were you able to label with data by car names if they weren't in the original data? How do you incorporate the
characters of your variables if all the data in a PCA needs to be numerical?

Xin Sun
30/09/2019 07:55 AM

Informative article. Thanks!

Claudia Silva
10/10/2019 07:55 AM

library(ggbiplot) is not available for R 3.6

I cannot upload here the output when I tried to install from GitHub. But it says that it can't do install the package ggbiplot.

is there any other library to do the same?

Yang Wang
11/10/2019 03:34 AM

Hello, thank you for the explanation! I have a question about the ggbiplot. So I looked around and found there seems no way to modify
the width of the ellipse, in base r plots, we just use lwd = 1 or whatever values, but in ggbiplot, is there also a way to change this? Also, is
it possible to add a polygon of the ellipse? Thank you, the two questions really confuse me a some time!

Lina Gao
29/10/2019 01:01 AM

Dear Luke
How about the contribution of mpg and drat to PC1? Can we say that the original variables with negative loading contribute less than
the ones with positive loading? Another question is whether we can use predict function to get the scores of the new sample and draw
it in the plot

Debbie Jenkins
04/11/2019 11:21 PM

Hello - like so many others I found this tutorial very helpful. Unfortunately, I can not, after trying everything that I've found on-line,
download ggbiplot.

package ‘digest’ successfully unpacked and MD5 sums checked

Error: Failed to install 'ggbiplot' from GitHub:

(converted from warning) cannot remove prior installation of package ‘digest’

In addition: Warning messages:

1: In untar2(tarfile, files, list, exdir) :

skipping pax global extended headers

2: In untar2(tarfile, files, list, exdir) :

skipping pax global extended headers

I hope you can help me. The package is exactly what I need :)

ps I tried install_github("vqv/ggbiplot", force = TRUE) but got the same message !

Aaron Soderstrom
13/11/2019 10:48 AM

Great introduction to plotting a PCA. Thank you for taking the time and putting this together. Its funny over the years of using R for
data science, I know a lot about cars now.. Cheers, Aaron
1

Frederico Faleiro
29/11/2019 08:28 PM

Dear @Luke Hayden, thanks for the very helpful tutorial. However I think there is an error in your prediction part. As you use the
option center and scale in the PCA, you must use both in the prediction, but you only center the prediction. I think for this reason the
spacecar are not an outlier in your graph. Check a discussion about it here: https://stat.ethz.ch/pipermail/r-help/2008-
April/160033.html.

I tryed put an example code but the system return the following error: Purify checking crashed :(

Waseem Ashfaq
03/12/2019 12:26 PM

Superb and very informative . . My question is how to make PCA representation bold (Whole PCA graphs).

Angela Marcela Suarez Mayorga

03/12/2019 07:17 PM

Thanks a lot for your detailed and easy-to-follow explanation. Very useful! (Also the comments, thank you all).

Stephana Müller
21/01/2020 02:23 AM

Great addition to the Unsupervised Machine Learning Course I took on DataCamp. Thanks!

1
Benjamin Malunda
09/02/2020 01:25 AM

greetings

when i try to run

install_github("vqv/ggbiplot")

it keeps giving me the following error:ERROR: failed to lock directory 'C:/Users/bkaso/Documents/R/win-library/3.6' for modifying

Try removing 'C:/Users/bkaso/Documents/R/win-library/3.6/00LOCK-stringi'

Error: Failed to install 'ggbiplot' from GitHub:

(converted from warning) installation of package ‘stringi’ had non-zero exit status

how can i solve this

Paul Cotter
11/02/2020 09:46 PM

My version of R cannot find the required packages...

library(devtools) or library(ggbiplot) where else might these be found?

ggbiplot(mtcars.pca)

Kamden Glade
10/03/2020 09:03 PM

Is there a way to use categorical variables to group the data from the original data set, rather than adding the grouping as you did for
country?

Subscribe to RSS

About Terms Privacy

Dzone Researchguide Automatedtesting
No ratings yet
Dzone Researchguide Automatedtesting
41 pages
OHS-PR-02-07 Document Control
100% (2)
OHS-PR-02-07 Document Control
14 pages
Probability Theory - Varadhan
100% (1)
Probability Theory - Varadhan
6 pages
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
No ratings yet
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
6 pages
1.solar Wireless Electric Vehicle Charging System
67% (3)
1.solar Wireless Electric Vehicle Charging System
38 pages
11 17 06 Modeling Sovereign Correlations
No ratings yet
11 17 06 Modeling Sovereign Correlations
21 pages
Faith in Mind PDF
No ratings yet
Faith in Mind PDF
2 pages
Comparing The Areas Under Two or More Correlated Receiver Operating Characteristic Curves A Nonparametric Approach
No ratings yet
Comparing The Areas Under Two or More Correlated Receiver Operating Characteristic Curves A Nonparametric Approach
10 pages
Practical Guide To Principal Component N R
No ratings yet
Practical Guide To Principal Component N R
43 pages
Electrical Drives and Control
100% (1)
Electrical Drives and Control
8 pages
Brushless DC Electric Motor
No ratings yet
Brushless DC Electric Motor
7 pages
Interntional Audit Workbook
100% (6)
Interntional Audit Workbook
133 pages
Solution CH # 5
No ratings yet
Solution CH # 5
39 pages
Bootstrap PDF
No ratings yet
Bootstrap PDF
24 pages
Factor B
No ratings yet
Factor B
68 pages
Eri WP Predicting Decomposing Risk Data Driven Portfolios 0 PDF
No ratings yet
Eri WP Predicting Decomposing Risk Data Driven Portfolios 0 PDF
46 pages
Two Stage Fama Macbeth
100% (1)
Two Stage Fama Macbeth
5 pages
New Multivariate Time-Series Estimators in Stata 11
100% (1)
New Multivariate Time-Series Estimators in Stata 11
34 pages
Principal Components Analysis
No ratings yet
Principal Components Analysis
50 pages
Calibration of The Schwartz-Smith Model For Commodity Prices
100% (1)
Calibration of The Schwartz-Smith Model For Commodity Prices
77 pages
History of Probability
50% (2)
History of Probability
17 pages
Data Analytics Using R
100% (1)
Data Analytics Using R
27 pages
BBC First Click Beginners Guide
No ratings yet
BBC First Click Beginners Guide
60 pages
David A. Freedman - The Limits of Econometrics PDF
100% (1)
David A. Freedman - The Limits of Econometrics PDF
13 pages
Bayesian Statistics: Thomas Bayes
No ratings yet
Bayesian Statistics: Thomas Bayes
22 pages
Smsa PDF
No ratings yet
Smsa PDF
61 pages
SGFilter - A Stand-Alone Implementation of The Savitzky-Golay Smoothing Filter
100% (2)
SGFilter - A Stand-Alone Implementation of The Savitzky-Golay Smoothing Filter
42 pages
Financial Engineering and Risk Management: Modeling Defaultable Bonds
No ratings yet
Financial Engineering and Risk Management: Modeling Defaultable Bonds
28 pages
Unofficiall Abbott 2nd Edition Solutions
No ratings yet
Unofficiall Abbott 2nd Edition Solutions
210 pages
Grimmett: One Thousand Probability Nightmares
100% (1)
Grimmett: One Thousand Probability Nightmares
448 pages
Stochastic Volatiity Models 2005 PDF
No ratings yet
Stochastic Volatiity Models 2005 PDF
35 pages
SSA Beginners Guide v9
No ratings yet
SSA Beginners Guide v9
22 pages
PCA - Principal Component Analysis: Step by Step Computation of PCA
No ratings yet
PCA - Principal Component Analysis: Step by Step Computation of PCA
2 pages
Comments On The Savitzky Golay Convolution Method For Least Squares Fit Smoothing and Differentiation of Digital Data
No ratings yet
Comments On The Savitzky Golay Convolution Method For Least Squares Fit Smoothing and Differentiation of Digital Data
4 pages
Chap 06
No ratings yet
Chap 06
12 pages
Time Series Models With Discrete Wavelet Transform
No ratings yet
Time Series Models With Discrete Wavelet Transform
11 pages
Multivariate Analysis
No ratings yet
Multivariate Analysis
23 pages
Principal Component Analysis Notes : Info
No ratings yet
Principal Component Analysis Notes : Info
22 pages
Mathematical Methods Notes
No ratings yet
Mathematical Methods Notes
432 pages
An Introduction To R
No ratings yet
An Introduction To R
105 pages
Dokumen - Pub Time Series Econometrics J 6726102
100% (2)
Dokumen - Pub Time Series Econometrics J 6726102
219 pages
Maximum Entropy Distribution of Stock Price Fluctuations
No ratings yet
Maximum Entropy Distribution of Stock Price Fluctuations
29 pages
Mathematics: Dispersion Trading Based On The Explanatory Power of S&P 500 Stock Returns
No ratings yet
Mathematics: Dispersion Trading Based On The Explanatory Power of S&P 500 Stock Returns
22 pages
Econ275 (Stanford) PDF
No ratings yet
Econ275 (Stanford) PDF
4 pages
MScFE 610 ECON - Compiled - Video - Transcripts - M4
No ratings yet
MScFE 610 ECON - Compiled - Video - Transcripts - M4
9 pages
Case 1
No ratings yet
Case 1
2 pages
The Econometric Modelling of Financial Time Series: Terence C. Mills
100% (1)
The Econometric Modelling of Financial Time Series: Terence C. Mills
11 pages
Singular Value Decomposition
No ratings yet
Singular Value Decomposition
36 pages
4th Semester Syllabus
No ratings yet
4th Semester Syllabus
16 pages
Advanced Ec Section 6
No ratings yet
Advanced Ec Section 6
5 pages
Format For GWA
No ratings yet
Format For GWA
6 pages
UNIT 1 Database Management System DBMS 2
No ratings yet
UNIT 1 Database Management System DBMS 2
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
5 pages
Big Data Analysis Assig.2
100% (1)
Big Data Analysis Assig.2
5 pages
Analytical Pricing of Basket Default Swaps in A Dynamic Hull & White Framework
No ratings yet
Analytical Pricing of Basket Default Swaps in A Dynamic Hull & White Framework
18 pages
OceanofPDF - Com Bayes Theorem - Dan Morris
No ratings yet
OceanofPDF - Com Bayes Theorem - Dan Morris
240 pages
Implementing HJM Model in Python
No ratings yet
Implementing HJM Model in Python
12 pages
Processor Architecture
No ratings yet
Processor Architecture
25 pages
Chapter 6: Time Series and Forecasting
No ratings yet
Chapter 6: Time Series and Forecasting
32 pages
Monte Carlo Integration Lecture
No ratings yet
Monte Carlo Integration Lecture
8 pages
9.3.1.2 CCNA Skills Integration Challenge
100% (1)
9.3.1.2 CCNA Skills Integration Challenge
7 pages
Introduction To Java Programming: Preparred by R.Divya, Btech (It)
No ratings yet
Introduction To Java Programming: Preparred by R.Divya, Btech (It)
24 pages
Musthaq Nazeer Resume & Portfolio
No ratings yet
Musthaq Nazeer Resume & Portfolio
21 pages
Let's Be Rational: J Ac06 Vog07
No ratings yet
Let's Be Rational: J Ac06 Vog07
12 pages
Form Supplier Registration Form - GDP
No ratings yet
Form Supplier Registration Form - GDP
6 pages
Assignment 3
No ratings yet
Assignment 3
5 pages
MVU
100% (1)
MVU
72 pages
Singular Value Decomposition Tutorial - Kirk Baker
No ratings yet
Singular Value Decomposition Tutorial - Kirk Baker
24 pages
Customer First Executive Order 072419
No ratings yet
Customer First Executive Order 072419
6 pages
Algorithmic Trading & Quantitative Strategies Gappy Lecture 5
No ratings yet
Algorithmic Trading & Quantitative Strategies Gappy Lecture 5
22 pages
Stock Price Forecasting Using Arima and Fourier Transforms
No ratings yet
Stock Price Forecasting Using Arima and Fourier Transforms
8 pages
Time Series Analysis and Spectral Analysis
No ratings yet
Time Series Analysis and Spectral Analysis
11 pages
Financial Engineering Proposed Applications
No ratings yet
Financial Engineering Proposed Applications
19 pages
The Apogee AD-8000 8-Channel, 24-Bit Converter
No ratings yet
The Apogee AD-8000 8-Channel, 24-Bit Converter
6 pages
Fixed Income Securities Concepts and Applications 9781547400669 9781547416738 Compress
No ratings yet
Fixed Income Securities Concepts and Applications 9781547400669 9781547416738 Compress
478 pages
FAQ Wifiunifi 24022020
No ratings yet
FAQ Wifiunifi 24022020
3 pages
BL Outline 14 01 24
No ratings yet
BL Outline 14 01 24
8 pages
ETFs Are Eating The Bond Market
No ratings yet
ETFs Are Eating The Bond Market
31 pages
Cytogenetic Mapping
No ratings yet
Cytogenetic Mapping
15 pages
Us2 Ss 56 Mothers Day Math Codebreaker Differentiated Activity Sheets English - Ver - 2
No ratings yet
Us2 Ss 56 Mothers Day Math Codebreaker Differentiated Activity Sheets English - Ver - 2
8 pages
Nse 3.1
No ratings yet
Nse 3.1
4 pages
Assignment 2 Elise Cook Unit 9
No ratings yet
Assignment 2 Elise Cook Unit 9
6 pages
Agile Unit-5
No ratings yet
Agile Unit-5
26 pages
ICT Assignment 4 Bachelors
No ratings yet
ICT Assignment 4 Bachelors
4 pages
PCA Using R
No ratings yet
PCA Using R
12 pages
Theory and Practice of Artificial Intelligence
No ratings yet
Theory and Practice of Artificial Intelligence
7 pages
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet
Examples and Problems in Mathematical Statistics
From Everand
Examples and Problems in Mathematical Statistics
Shelemyahu Zacks
5/5 (2)
Credit Models and the Crisis: A Journey into CDOs, Copulas, Correlations and Dynamic Models
From Everand
Credit Models and the Crisis: A Journey into CDOs, Copulas, Correlations and Dynamic Models
Damiano Brigo
No ratings yet