0% found this document useful (0 votes)
9 views294 pages

Lecture 3&4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views294 pages

Lecture 3&4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 294

Msc EDCBA

Marc-Arthur Diaye
Full Professor
University Paris 1 Pantheon-Sorbonne

Data Analytics
Datavisualization
• Concerning Correlation matrix

• Example 2 :
• Suppose that we want to change the color.
• Several methods.

• Method 1:
• #(20) means that we want a vector with with length 20
• #We take here three colors : red, black, and blue
• #The colors will be a mixed of these three colors

• col_pse<- colorRampPalette(c("red", "#999999", "blue"))(20)


• corrplot(m_newdata2, col=col_pse)
Datavisualization
• Concerning Correlation matrix

• Example 2 :
• Suppose that we want to change the color.
• Several methods.

• Method 2:
• #Use the package RColorBrewer
• #Select a specific palette of colors (for example «RdBu») and the
number of colors from this palette (for exemple n=8)
• #If RColorBrewer is not already installed, you need first to do so
• install.packages("RColorBrewer")
• library(RColorBrewer)
• corrplot(m_newdata2, col=brewer.pal(n=8, name="RdBu"))
RColorBrewer
Datavisualization
• Remark:

• We can ask « R » to display this figure

• display.brewer.all()

• We can also ask « R » to display only a part of


the figure. For instance :
• display.brewer.pal(n = 8, name = "RdBu")
Datavisualization
• Remark:
• The result of this latter instruction is the
following:
Datavisualization
• Concerning Correlation matrix

• Example 2 :

• Method 2:
• #Let now go back to the instruction:
• corrplot(m_newdata2, col=brewer.pal(n=8,
name="RdBu"))
Datavisualization
• Concerning Correlation matrix

• Example 2 :
• Suppose that we want to change the color.
• Several methods.

• Method 3:
• #Use the package Wes Anderson
• #Select a specific palette of colors (for example «Darjeeling1») and
the number of colors from this palette (for exemple n=5)
• #If Wes Anderson is not already installed, you need first to do so
• install.packages("wesanderson")
• library(wesanderson)
• corrplot(m_newdata2, col=wes_palette(n=5,
name="Darjeeling1"))
Datavisualization
• Concerning Correlation matrix

• Example 2 :
• Suppose that we want to change the color.
• Several methods.

• Method 3:
• #Use the package Wes Anderson

• Palette of colors available :


• BottleRocket1, BottleRocket2, Rushmore1, Royal1, Royal2, Zissou1,
Darjeeling1, Darjeeling2, Chevalier1 , FantasticFox1 , Moonrise1,
Moonrise2, Moonrise3, Cavalcanti1, GrandBudapest1,
GrandBudapest2, IsleofDogs1, IsleofDogs2
Datavisualization
• Concerning Correlation matrix

• Example 2b :
• Suppose that we want to change the
backgroung color (bg).
• And the color of the variables’ names (tl.col)

• corrplot(m_newdata2, col=c("black",
"white"), bg="lightblue", tl.col="black")
Datavisualization
• Concerning Correlation matrix

• Example 3 :
• corrplot(m_newdata2, method="pie")
Datavisualization
• Concerning Correlation matrix

• Example 3b :
• corrplot(m_newdata2, method="ellipse")
Datavisualization
• Concerning Correlation matrix

• Example 4 :
• corrplot(m_newdata2, method="color")
Datavisualization
• Concerning Correlation matrix

• Example 5 :
• corrplot(m_newdata2, method="number")
Datavisualization
• Concerning Correlation matrix

• Example 6 :
• corrplot(m_newdata2, method="color",
type="lower")

• #Display only the lower part of the


correlation matrix
Datavisualization
• Concerning Correlation matrix

• Example 7 :
• corrplot(m_newdata2, method="color",
type="upper")

• #Display only the upper part of the


correlation matrix
Datavisualization
• Concerning Correlation matrix

• Example 8 :
• corrplot(m_newdata2, order="AOE",
method="color", addCoef.col = "#999999")

• #AOE means Angular Order of the


Eigenvectors
• #It is a method the
Datavisualization
• Concerning Correlation matrix

• Example 8b :
• corrplot(m_newdata2, order=“alphabet")

• #Alphabet order
Datavisualization
• Concerning Correlation matrix

• Example 9 :
• corrplot.mixed(m_newdata2, order="AOE")
Datavisualization
• Concerning Correlation matrix

• Example 10 :
• res<-cor.mtest(m_newdata2, conf.level=.99)
• corrplot(m_newdata2, p.mat=res$p, sig=.1)

• #We ask here « R » to compute the significance test


which produces p-values and confidence intervals for
each pair of input features.
• #p.mat is the matrix of p-values. We want this matrix
to be p , the square matrix with p-values as cells
resulting from cor.mtest.
Datavisualization
• Concerning Correlation matrix

• Example 10 :
• res<-cor.mtest(m_newdata2, conf.level=.99)
• corrplot(m_newdata2, p.mat=res$p, sig=.1)

• # sig = Significant level


• # Here we want this significant level to be equal to .01
(1%)
• # If the p-value in p.mat is bigger than sig.level, then
the corresponding correlation coefficient is regarded
as insignificant.
Datavisualization
• Concerning Correlation matrix

• Example 10b :
• res<-cor.mtest(m_newdata2, conf.level=.95)
• corrplot(m_newdata2, p.mat=res$p, sig=.01)
Datavisualization
• Concerning Correlation matrix

• Example 10c :
• corrplot(m_newdata2, p.mat=res$p,
insig="blank")
Datavisualization
• Concerning Correlation matrix

• Example 10d :
• corrplot(m_newdata2, p.mat=res$p, insig="p-
value")

• #Displays the p-value for the insignificant


correlations
Datavisualization
• Concerning Correlation matrix

• Example 10e :

• corrplot(m_newdata2, p.mat=res$p, insig="p-


value", sig=-1)

• #The sig is artificially stated at -1


• #This oblige all correlations to be non significant,
in order to display them in the correlation matrix
Datavisualization
• Concerning Correlation matrix

• Example 10f :

• corrplot(m_newdata2, p.mat=res$p, insig="p-


value", sig=-1)

• #The sig is artificially stated at -1


• #This obliges all correlations to be non
significant, in order to display them in the
correlation matrix
Datavisualization
• Concerning Correlation matrix

• Example 11 :

• Put the variables’ names on the diagonale of


the correlation matrix.
Introduction to GGPLOT 2
CLASS 3
Datavisualization
• Creating a ggplot : geometric objects

• Count in two dimensions

• ggplot2 provides two geometries, geom_bin2d and geom_hex,


which permit to count the number of observations in two
dimensions.

• ggplot(mydata) +
• aes(x = age, y = salnet) +
• geom_bin2d() +
• xlab("Age") +
• ylab("Net wage") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : geometric objects

• Interpretation :
• Workers having netwage = 100000 euro are between
28 and 65 years old. Moreover it seems that there are
not a lot of workers having 100000 euro a year.
• It seems that the most numerous workers are around
25 years old and get around 17000 euro a year, or are
around 33 years old and get around 21500 euro a year,
or are between 40 and 43 years old and get around
21500 euro a year.
Datavisualization
• Creating a ggplot : geometric objects

• Count in two dimensions

• ggplot2 provides two geometries, geom_bin2d and geom_hex, which


permit to count the number of observations in two dimensions.

• install.packages("hexbin")
• library(hexbin)

• ggplot(mydata) +
• aes(x = age, y = salnet) +
• geom_hex() +
• xlab("Age") +
• ylab("Net wage") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : geometric objects

• Count in two dimensions

• If we want to restrict the number of bins, when using


geom_hex. (of course, this restriction can be made
also for geom_bin2d)

• Example: #Restriction to a maximum of 10 bins both


horizontally and vertically
• ggplot(mydata, mapping=aes(age, salnet)) +
• geom_hex(bins = 10) + xlab("age") + ylab("Net wage")
Datavisualization
• Creating a ggplot : geometric objects

• Practice 11

• Create the code necessary to have the


following graph:
Change the background, change the colors and restrict the number of bins
Datavisualization
• Creating a ggplot : geometric objects

• Solution to Practice 11
• #bins=10, change the color, change the
background
• ggplot(mydata, mapping=aes(age, salnet)) +
• scale_fill_gradient(low = "#00FF00", high =
"#FFFF00") + geom_hex(bins = 10) +
theme_classic() + xlab("age") + ylab("Net
wage")
Datavisualization
• Creating a ggplot : geometric objects

• Count in two dimensions

• With 2 qualitative variables

• ggplot(mydata) +
• aes(x = cscor, y = ag5) +
• geom_bin2d() +
• xlab("cscor") +
• ylab("Age") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : geometric objects

• Interpretation :
• Blue-collars are mainly between 40 and 49 years old.
• Workers between 40 and 49 years old are mainly Blue-collars.
• Blue-collars between 40 and 49 years old are the most numerous.

• We can check with table(ag5,cscor)


• cscor
• ag5 3 4 5 6
• 15 to 29 217 543 699 925
• 30 to 39 632 1081 762 1584
• 40 to 49 663 967 638 1683
• 50 to 59 454 674 350 1047
• At least 60 25 11 12 17
Datavisualization
• Creating a ggplot : geometric objects

• Count in two dimensions

• With 2 qualitative variables

• ggplot(mydata) +
• aes(x = cscor, y = ag5) +
• geom_hex() +
• xlab("cscor") +
• ylab("Age") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : geometric objects

• Interpretation :
• The most numerous workers are those between
30 and 39. They are followed by those between
40 and 49; followed by those between 15 and
29; an finally the 50 and more.

• ag5
• 15-29 30-39 40-49 50-59 60 and more
• 2384 4059 3951 2525 65
Datavisualization
• Creating a ggplot : geometric objects

• Count in two dimensions

• With 1 qualitative variable and 1 continuous variable

• ggplot(mydata) +
• aes(x = cscor, y = age) +
• geom_hex() +
• xlab("cscor") +
• ylab("Age") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : annotation

• In addition to labelling major components of our plot,


it is often useful to label individual observations or
groups of observations.

• The first tool we have at our disposal is geom_text().

• geom_text() is similar to geom_point(), but it has an


additional aesthetic: label.
• This makes it possible to add textual labels to our
plots.
Datavisualization
• Creating a ggplot : annotation

• Suppose that we want add the workers’


professional status (manager/non manager)
to graph of the distribution (salnet, age).

• ggplot(mydata) + geom_text(aes(x=ag5,
y=salnet, label = v_manager))
Datavisualization
• Creating a ggplot : annotation

• WARNING : You need first to create to add the


variable "v-manager" to mydata

• mydata$v_manager<-rep("Non
manager",length(cscor))
• mydata$v_manager[cscor %in%
c(3,4)]="Manager"

• table(mydata$v_manager)
Datavisualization
• Creating a ggplot : annotation

• We can do the same with different color w.r.t


the sex variable.

• ggplot(mydata) +
• geom_text(aes(x=ag5, y=salnet, label =
v_manager, color=sex))
Datavisualization
• Creating a ggplot : annotation

• Of course, we can use both a geom_point and


a geom_text .
• Here the point are colored w.r.t the sex
variable and the graph give the professional
status (manager/non manager) of the
individuals with maximum wage (in each age
group).
Datavisualization
• Creating a ggplot : annotation

• best_in_age =
mydata[c("sex","sexe","salnet","cscor","age","ag
5","v_manager")]

• #We use here what is called a “PIPE”


• best_in_age2<-best_in_age %>%
• group_by(ag5,v_manager) %>%
• summarize(salnet = max(salnet, na.rm = TRUE))

Datavisualization Max(salnet)

v_manager salnet
ag5

1 15 à 29 ans Manager 100000


2 15 à 29 ans Non manager 41271
3 30 à 39 ans Manager 100000
4 30 à 39 ans Non manager 100000
5 40 à 49 ans Manager 100000
6 40 à 49 ans Non manager 62171
7 50 à 59 ans Manager 100000
8 50 à 59 ans Non manager 100000
9 60 ans et plus Manager 100000
10 60 ans et plus Non manager 31709
Datavisualization
• Creating a ggplot : annotation

• ggplot(mydata, aes(ag5, salnet)) +


• geom_point(aes(colour = sex)) +
• geom_text(aes(label = v_manager), data =
best_in_age2)
Datavisualization
• Creating a ggplot : annotation

• Interpretation:
• Concerning workers between 15 and 29 years
old, the non manager who earns the highest
net wage is a man. His wage is around 41K€.
The manager who earns the highest net wage
can be a woman or a man. His/her wage is
100K€.
Datavisualization
• Creating a ggplot : annotation

• The graph is hard to read because the labels


overlap with each other, and with the points.

• We can make things a little better by


switching to geom_label() which draws a
rectangle behind the text.
Datavisualization
• Creating a ggplot : annotation

• ggplot(mydata, aes(ag5, salnet)) +


• geom_point(aes(colour = sex)) +
• geom_label(aes(label = v_manager), data =
best_in_age2)
Datavisualization
• Creating a ggplot : annotation

• In this graph, for 30-39 years old individuals and


for 50-59 years old individuals, the longest label
(« non manager ») has erased the « manager »
label.
• Moreover it is difficult to see the points
associated with some labels. For instance at the
highest wage (100K€), it is very difficult to
comment about the gender of the individuals.
Datavisualization
• Creating a ggplot : annotation

• We will therefore use the alpha parameter in


order to make translucent the labels:

• ggplot(mydata, aes(ag5, salnet)) +


• geom_point(aes(colour = sex)) +
• geom_label(aes(label = v_manager), data =
best_in_age2, alpha=0.5)
Datavisualization
• Creating a ggplot : annotation

• We can use the nudge_y parameter or/and


nudge_x to move the labels slightly away from
the corresponding points:

• ggplot(mydata, aes(ag5, salnet)) +


• geom_point(aes(colour = sex)) +
• geom_label(aes(label = v_manager), data =
best_in_age2, nudge_x=-0.1, alpha=0.5)
Datavisualization
• Creating a ggplot : annotation

• We can use the nudge_y parameter or/and


nudge_x to move the labels slightly away from
the corresponding points:

• ggplot(mydata, aes(ag5, salnet)) +


• geom_point(aes(colour = sex)) +
• geom_label(aes(label = v_manager), data =
best_in_age2, nudge_y=2, alpha=0.5)
Datavisualization
• Creating a ggplot : annotation

• Remark concerning nudge_y and nudge_x :


• Actually the full syntax is position=position_nudge()
• The arguments x and y are the amount of vertical and
horizontal distance to move.
position=position_nudge(x=0,y=0)

• ggplot(mydata, aes(ag5, salnet)) +


• geom_point(aes(colour = sex)) +
• geom_label(aes(label = v_manager), data = best_in_age2,
position=position_nudge(x=-0.1, y=2), alpha=0.5)
Datavisualization
• Creating a ggplot : annotation

• This is not satisfying because for 30-39 years old


individuals and for 50-59 years old individuals, there
are two labels practically on top of each other.

• There is no way that we can fix these by applying the


same transformation for every label.

• Instead, we can use the ggrepel package.


• This useful package will automatically adjust labels so
that they don’t overlap.
Datavisualization
• Creating a ggplot : annotation

• install.packages("ggrepel")
• library(ggrepel)

• ggplot(mydata, aes(ag5, salnet)) +
• geom_point(aes(colour = sex)) +
• geom_point(size = 3, shape = 1, data =
best_in_age2) +
• ggrepel::geom_label_repel(aes(label =
v_manager), data = best_in_age2)
Zoom in order to better see the output
Datavisualization
• Creating a ggplot : annotation

• We can use also geom_text_repel

• gplot(mydata, aes(ag5, salnet)) +


• geom_point(aes(colour = sex)) +
• geom_point(size = 3, shape = 1, data =
best_in_age2) +
• ggrepel::geom_text_repel(aes(label =
v_manager), data = best_in_age2)
SHAPE
Datavisualization
• Creating a ggplot : annotation

• ggplot(mydata, aes(ag5, salnet)) +


• geom_point(aes(colour = sex)) +
• geom_point(size = 6, shape = 2, data =
best_in_age2) +
• ggrepel::geom_label_repel(aes(label =
v_manager), data = best_in_age2)
Datavisualization
• Creating a ggplot : geometric objects

• Remark concerning the PIPE


• In one of the above codes, we have used the so-
called PIPE
• Let us remind the code in which we have used a
PIPE:
• best_in_age2<-best_in_age %>%
• group_by(ag5,v_manager) %>%
• summarize(salnet = max(salnet, na.rm = TRUE))
Datavisualization
• Creating a ggplot : geometric objects

• Remark concerning the PIPE


• A Pipe is a nice tool for expressing a sequence of
multiple operations.
• It permits to define a sequence of instructions:
• First you do this
• Then you do this
• And so on (…)
• In other words, Pipes take the output of one
function and send it directly to the next.
Datavisualization
• Creating a ggplot : geometric objects

• Remark concerning the PIPE


• A Pipe is denoted %>%

• In our code
• best_in_age2<-best_in_age %>%
• group_by(ag5,v_manager) %>%
• summarize(salnet = max(salnet, na.rm = TRUE))

• we ask the software to :


• create a database called best_in_age2 from the dataset
best_in_age by selecting the maximum wage for each class of age
(five categories) and professional status (two categories).
• This is done thorough three operations.
Datavisualization
• Creating a ggplot : geometric objects

• Remark concerning the PIPE


• Remark : Package “tidyverse” permits to use directly
a pipe without loading another package.

• This is why we can use the PIPE method directly here.

• Of course the PIPE method can be used without


loading the package “tidyverse” which is a very
specific package (devoted to datavisualization).
• If we do not need to use “tidyverse”, then we can load
package magrittr in order to use the PIPE method.
Datavisualization
• Creating a ggplot : geometric objects

• Remark concerning the double colon “::”

• ggrepel::geom_label_repel
• means : use the “geom_label_repel” function
from the “ggrepel” package.
Datavisualization
• Creating a ggplot : geometric objects

• Remark on geom_label
• Of course, like geom_text, it is possible to use
an geom_label alone.

• Example (BE PATIENT):


• ggplot(mydata) + geom_label(aes(x=ag5,
y=salnet, label = v_manager))
Datavisualization
• Creating a ggplot : geometric objects

• Practice 12

• Salnet, nafen_g4, nafen_g16


• Delete the observations with missing values
• COI2006 does not includes agriculture business
sector. Delete the observation corresponding to
this business sector.
• Create the code necessary to have the following
graph:
Datavisualization
• Creating a ggplot : geometric objects

• Solution to Practice 12

• newdata<-subset(mydata, nafen_g4%in%c("EU","ET","EV"))
• table(newdata$nafen_g4)

• best_in_bs = newdata[c("nafen_g4","nafen_g16","ag5", "age", "salnet")]
• best_in_bs2<-best_in_bs %>%
• group_by(ag5,nafen_g4) %>%
• summarize(salnet = max(salnet, na.rm = TRUE))

• ggplot(newdata, aes(ag5, salnet)) +
• geom_point(aes(colour = nafen_g16)) +
• geom_point(size = 2, shape = 1, data = best_in_bs2) +
• ggrepel::geom_label_repel(aes(label = nafen_g4), data = best_in_bs2)
Datavisualization
• Creating a ggplot : statistical transformation

• Here we consider a basic bar chart, as drawn


with geom_bar()

• Example:
• ggplot(data = mydata) + geom_bar(mapping
= aes(x = cscor))
Datavisualization
• Creating a ggplot : Position adjustment

• Display a bar chart of proportion, rather than


count
• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, y=..prop.., group = 1))
Datavisualization
• Creating a ggplot : Position adjustment

• Display a bar chart of proportion, rather than count

• The “group” option permits to compute the


proportion for each modality of cscor.

• What happens if we remove the “group” option


• Example:
• ggplot(data = mydata) + geom_bar(mapping = aes(x =
cscor, y=..prop..))
Datavisualization
• Creating a ggplot : Position adjustment

• We can colour a bar chart using either the


colour aesthetic, or, more usefully, fill:

• Example:
• #Use of the colour aesthetic
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, colour=cscor))
Datavisualization
• Creating a ggplot : Position adjustment

• We can colour a bar chart using either the


colour aesthetic, or, more usefully, fill:

• Example:
• #Use of the fill aesthetic
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=cscor))
Datavisualization
• Creating a ggplot : Position adjustment

• Choose our colours.

• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=cscor)) +
scale_fill_manual(values=c("blue",
"#FF3399", "#FFFF33", "#00FF00"))
Datavisualization
• Creating a ggplot : Position adjustment

• Practice 13

• Create the code necessary to have the


following graph:
The colors here are not the default one. Use blue, pink, yellow and green
Datavisualization
• Creating a ggplot : Position adjustment

• Solution to Practice 13

• ggplot(data = mydata) + geom_bar(mapping =


aes(x = cscor, colour=cscor)) +
scale_colour_manual(values=c("blue",
"#FF3399", "#FFFF33", "#00FF00"))
Datavisualization
• Creating a ggplot : Position adjustment

• What happens if you map the fill aesthetic to


another variable, like sex: the bars are
automatically stacked. Each colored rectangle
represents a combination of cscor and sex :

• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=sex))
Datavisualization
• Creating a ggplot : Position adjustment

• It is also possible to use the colour aesthetic:

• Example:
• #Use of the colour aesthetic
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, colour=sex))
Datavisualization
• Creating a ggplot : Position adjustment

• Use the colour aesthetic without filling the


bars.

• Example:
• #Use of the colour aesthetic
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, colour=sex), fill=NA)
Datavisualization
• Creating a ggplot : Position adjustment

• If we don’t want a stacked bar chart, we can


use one of two other options: "dodge" or
"fill".
• position = "fill" works like stacking, but
makes each set of stacked bars the same
height. This makes it easier to compare
proportions across groups.
Datavisualization
• Creating a ggplot : Position adjustment

• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=sex), position=“fill”)
Datavisualization
• Creating a ggplot : Position adjustment

• position = “dodge" places overlapping


objects directly beside one another. This
makes it easier to compare individual values.
Datavisualization
• Creating a ggplot : Position adjustment

• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=sex), position=“dodge”)
Datavisualization
• Creating a ggplot : statistical transformation

• With two variables


• Example:
• # Default: dark bars
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, y = salnet), stat = "identity")

• ggplot(data=mydata, aes(x=cscor, y=salnet)) +


geom_bar(stat="identity")
Datavisualization
• Creating a ggplot : statistical transformation

• We can colour a bar chart using either the


colour aesthetic, or, more usefully, fill:

• Example:
• # bars filled with other colors
• ggplot(data=mydata, aes(x=cscor, y=salnet)) +
geom_bar(stat="identity", fill="#FF9999")
Datavisualization
• Creating a ggplot : statistical transformation

• We can colour a bar chart using either the


colour aesthetic, or, more usefully, fill:

• Example:
• # bars filled with other colors
• ggplot(data=mydata, aes(x=cscor, y=salnet)) +
geom_bar(stat="identity", fill=“blue")
Datavisualization
• Creating a ggplot : statistical transformation

• Instead of changing colors globally, we can


map variables to colors.

• In other words, make the color conditional on


a variable, by putting it inside an aes()
statement.
Datavisualization
• Creating a ggplot : statistical transformation

• # Bars: x and fill both depend on cscor


• ggplot(mydata, aes(x=cscor, y=salnet,
fill=cscor)) + geom_bar(stat="identity")
Datavisualization
• Creating a ggplot : Coordinate systems

• The default coordinate system is the


Cartesian coordinate system where the x and
y positions act independently to determine
the location of each point.

• We can however use other coordinate


systems that could be interesting.
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_flip() switches the x and y axes.

• Example :
• ggplot(data=mydata, aes(x=cscor, y=salnet)) +
geom_point()+coord_flip()
The two axes are switched
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_flip() switches the x and y axes.


• coord_flip() is useful for long labels: it’s hard to
get them to fit without overlapping on the x-axis.

• Example : without coord_flip()


• ggplot(data=mydata, aes(x=ag5, y=salnet)) +
geom_point()
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_flip() switches the x and y axes.


• coord_flip() is useful for long labels: it’s hard to
get them to fit without overlapping on the x-axis.

• Example : with coord_flip()


• ggplot(data=mydata, aes(x=ag5, y=salnet)) +
geom_point()+coord_flip()
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_flip() switches the x and y axes.

• coord_flip() is useful if we want horizontal


boxplots.

• Example : without coord_flip()


• ggplot(data=mydata, aes(x=ag5, y=salnet)) +
geom_boxplot()
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_flip() switches the x and y axes.

• coord_flip() is useful if we want horizontal


boxplots.

• Example : with coord_flip()


• ggplot(data=mydata, aes(x=ag5, y=salnet)) +
geom_boxplot()+coord_flip()
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_flip() switches the x and y axes.


• coord_polar() uses polar coordinates.

• Example : with coord_flip()


• ggplot(data=mydata, aes(x=cscor, fill=cscor))
+ geom_bar()+coord_flip()
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_flip() switches the x and y axes.


• coord_polar() uses polar coordinates.

• Example : with coord_polar()


• ggplot(data=mydata, aes(x=cscor, fill=cscor))
+ geom_bar()+coord_polar()
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_flip() switches the x and y axes.


• coord_polar() uses polar coordinates.

• Example : with coord_polar() and defining the


width
• ggplot(data=mydata, aes(x=cscor, fill=cscor)) +
geom_bar(width = 1)+coord_polar()+labs(x=NULL)
Datavisualization
• Creating a ggplot : Coordinate systems

• Remark : Coxcomb chart


• The previous chart is called Coxcomb chart.
• In such a chart, each category is represented
by a section of the disc, and each section has
the same angle.
• The area of a section represents the value of
the corresponding category.
Datavisualization
• Creating a ggplot : Coordinate systems

• Remark : Coxcomb chart


• In order to better understand, let take a two
modalities variable : sex

• ggplot(data=mydata, aes(x=sex, fill=sex)) +


geom_bar(width =
1)+coord_polar()+labs(x=NULL)
Datavisualization
• Creating a ggplot : Coordinate systems

• Remark : Coxcomb chart

• Let remind the distribution of sex:


• sex
• 1 2
• 8159 4825
Datavisualization
• Creating a ggplot : Coordinate systems

• Syntax of the coord_polar()

• Syntax :
• coord_polar(theta, start, direction, clip)
Datavisualization
• Creating a ggplot : Coordinate systems

• Syntax :
• coord_polar(theta, start, direction, clip)
• theta : variable to map angle to (x or y)
• start : offset of starting point from 12 o'clock in radians
• direction : 1, clockwise; -1, anticlockwise
• clip : should drawing be clipped to the extent of the plot panel? A
setting of "on" (the default) means yes, and a setting of "off"
means no.

• By default, we have coord_polar(theta = "x", start = 0, direction =


1, clip = "on")
• This corresponds to coord_polar()
Datavisualization
• Creating a ggplot : Coordinate systems

• Syntax :
• coord_polar(theta, start, direction, clip)
• For instance if we run:

• ggplot(data=mydata, aes(x=cscor, fill=cscor))


+ geom_bar(width =
1)+coord_polar(theta=“y”)+labs(x=NULL)
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_polar()
• The polar coordinate system is most commonly
used for pie charts, which are a stacked bar chart
in polar coordinates.
• To this purpose, we will use “factor”
• Example :
• ggplot(data=mydata, aes(x=factor(1),
fill=factor(cscor))) + geom_bar(width =
1)+coord_polar(theta=“y”)+labs(x=NULL)
Datavisualization
• Creating a ggplot : Coordinate systems

• coord_polar()
• The polar coordinate system can be used for a
bullseye chart.
• To this purpose, we will use also “factor”
• Example :
• ggplot(data=mydata, aes(x=factor(1),
fill=factor(cscor))) + geom_bar(width =
1)+coord_polar()+labs(x=NULL)
Clustering with K-means
CLASS 4
Datavisualization
• Clustering data with K-means

• Cluster analysis is part of the so-called


unsupervised learning.
• A cluster is a group of data that share similar
features.
• K-means is one of the most popular
clustering method.
Datavisualization
• Clustering data with K-means

• The K-means algorithm works in four steps:

• Step 1: Choose groups in the feature plan randomly


• Step 2: Minimize the distance between the cluster center
and the different observations (centroid).
• Step 3: Shift the initial centroid to the mean of the
coordinates within a group.
• Step 4: Minimize the distance according to the new
centroids. New boundaries are created. Thus, observations
will move from one group to another
• [Repeat until no observation changes groups]
K-means algorithm
Datavisualization
• Clustering data with K-means

• K-means usually takes the Euclidian distance :



• where x and y are n-dimension vectors
Datavisualization
• Clustering data with K-means

• K-means may take also the Manhattan distance :


• The Manhattan distance is the distance
associated with the L1-norm.
• Remind that if then L1-
norm writes :
• The Manhattan distance is

• where x and y are n-dimension vectors


Datavisualization
• Clustering data with K-means

• K-means may take also the Minkowski distance :


• The Minkowski distance is a generalization of the
Manhattan distance and the Euclidian distance.
• It is the distance associated with the Lp-norm.
• Remind that if then Lp-
norm writes :
• The Minkoswki distance is

• where x and y are n-dimension vectors


Datavisualization
• Clustering data with K-means

• For more information about Minkowski


distance:

• https://www.sciencedirect.com/topics/comp
uter-science/minkowski-distance
Datavisualization
• Clustering data with K-means
• Let n be the number of observations from the dataset.
• Let p be the number of characteristics (attributes): j=1
to p.
• is the vector of observations concerning
characteristic « j »

• is the observation concerning characteristic « j »
and individual « i »
• is the matrix (n,p) of the vectors
Datavisualization

• Clustering data with K-means

• Example : Beauty contest


Charact 1 Charact 2 Charact 3
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate 5
Datavisualization

• Clustering data with K-means

• Vector of observations for characteristic « 1 »



• Vector of observations for characteristic « 2 »

• Vector of observations for characteristic « 3 »

Datavisualization

• Clustering data with K-means

• We have 5 candidates and we want to know if


using the 3 characteristics, we can group
them.
• For instance, like this :
• Cluster 1 : candidates 1, 5 and 4
• Cluster 2 : Candidates 2 and 3
Datavisualization

• Clustering data with K-means

• This means that :


• candidates 1, 5 and 4 are close with respect
to characteristics 1, 2 and 3;
• while candidates 3 and 2 are close with
respect to characteristics 1, 2 and 3.
Datavisualization
• Clustering data with K-means
• Let K be the number of clusters : , that we
want to have.
• The number K is stated arbitrarly by the user.
• Once the user has stated the number of clusters that
he/she wants, the K-means algorithm will identify the
K centroids (in French: centre de gravité) , k=1 to K,
of the clusters.
• The clusters minimize the distance between the
observations assigned to a cluster and the associated
centroids.
Datavisualization
• Clustering data with K-means
• Namely:

• Minimization of ( ) ( )

• where is the assigned cluster of data x


• x is a data of dimension p (x is a vector of observations concerning
an indiividual; in the previous example, there are five x)

• Remark : = Square of Euclidian


distance between x and 𝒌
Datavisualization
• Clustering data with K-means
• How does the algo work ?

• We enter and K
• Let , k=1 to K, be the initial values of the centroids
• t=1
• Do until the STOP CRITERIA is satisfied :
• Assign each observation x to a cluster : (𝐭)
(𝐭 𝟏) (𝐭 𝟏)

(𝐭)
• Let be the set of observations assigned to the cluster k :
(𝐭) (𝐭)
Datavisualization

• Clustering data with K-means

• How does the algo work ?

• Update the centroids of the K clusters :


(𝐭) (𝐭)

• t=t+1
Datavisualization

• Clustering data with K-means

• How does the algo work ?

• Remind that the STOP CRITERIA is the


following :
• NO OBSERVATION MOVES FROM ONE
CLUSTER TO ANOTHER
Datavisualization

• Clustering data with K-means

• Example 1
• Let us consider a dataset with 5 individuals and 1
characteristic Individual Characteristic 1
1 1
2 2
3 9
4 12
5 20
Datavisualization

• Clustering data with K-means

• Question:
• Apply K-means algorithm with
• K=2
• and ,
Datavisualization
• Clustering data with K-means
• Answer:
• t=1 𝑥−𝜇 = 1−1 =0

x 1 2 9 12 20

( ) 0 1 64 121 361
( ) 361 324 121 64 0

1 1 1 2 2

( ) ( )

Datavisualization
• Clustering data with K-means

• UPDATE


Datavisualization
• Clustering data with K-means
• Answer:
• t=2 𝑥−𝜇 = 1−4 =9

x 1 2 9 12 20

( ) 9 4 25 64 256
( ) 225 196 49 16 16

1 1 1 2 2

( ) ( )

Datavisualization
• Clustering data with K-means

• The STOP CRITERIA is reached :


• Because no observation (from t=1 to t=2) has moved
from one cluster to another.

• Then we stop.

• We get the below two clusters.


• Cluster 1 includes individuals 1, 2 and 3;
• Cluster 2 includes individuals 4 and 5
Datavisualization
• Clustering data with K-means

• Practice 14 : Dataset (same as in Example 1 : 5


individuals and 1 characteristic)

• 1 2 9 12 20

• Suppose that we want 3 clusters : K=3




Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• As we have explained above, Kmeans uses a
distance function in order to determine the
various clusters. As a consequence, the units of
the variables will be of a high importance.
• For instance, suppose that there are two
variables : Age (in years) and height (in cm).
• Suppose that the Age variable ranges from 18 to
50; while the height variable ranges from 130 to
210.
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization

• If we use the classical Euclidian distance, then


the height will have disproportionately more
importance in its computation w.r.t age.
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization

• A solution is to pre-process the data before


running a K-means. For instance, by:
i. Transforming the data using a z-score.
(Standardization).
ii. Rescaling the data to have values between 0 and
1. (Feature scaling : Normalization).
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• The terms normalization and standardization
are sometimes used interchangeably, but
they actually refer to different things.
• Normalization means to scale a variable to
have values between 0 and 1.
• Standardization transforms data to have a
mean of zero and a standard deviation of 1.
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• Sometimes what I called “normalization” is
called “standardization” in some textbook
and vice-versa.
• What is important is that the two terms
means different things.
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• Normalization with a z-score : Value
concerning characteristic “z” for individual “i”
is transformed by : where and are
respectively the mean value and the
standard-deviation of z.
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• Standardization using the min and max values
: Value concerning characteristic “z” for
individual “i” is transformed by :
where and are respectively the
minimal and the maximal values of z.
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• Normalization can be obtained using the so-
called procedure “scale”.
• Suppose that mydata is the data set that you
want to use. Then we can normalize the
variables in this dataset using the “scale”
procedure:
• mydata_2=scale(mydata)
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• Or we can do it by ourselves directly. Suppose
that there two variables (myVar1, myVar2) in
mydata. Then we can create their normalized
version:
• mydata$zVar1 <- (mydata$myVar1 -
mean(mydata$myVar1))/sd(mydata$myVar1)
• mydata$zVar2 <- (mydata$myVar2 -
mean(mydata$myVar2))/sd(mydata$myVar2)
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• Likewise, assuming that there two variables
(myVar1, myVar2) in mydata; we can create their
standardized version:
• mydata$sVar1 <- (mydata$myVar1 -
min(mydata$myVar1))/(max(mydata$myVar1)-
min(mydata$myVar1))
• mydata$sVar2 <- (mydata$myVar2 -
min(mydata$myVar2))/(max(mydata$myVar2)-
min(mydata$myVar2))
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• We can also standardized (or normalized) the
data using data.Normalization function in
clusterSim package.

• install.packages("clusterSim")
• library(clusterSim)
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization

• Syntax :
• data.Normalization(x,type=“…",normalization
="column")
Datavisualization
• Clustering data with K-means

• ARGUMENT :

• x : vector, matrix or dataset type


• type:
• n0 : original data
• n1 : (x-mean)/sd
• n2 : (x-median)/mad
• n3 : (x-mean)/range
Datavisualization
• Clustering data with K-means

• ARGUMENT :

• type:
• n3a : (x-median)/range
• n4 : x-min)/range
• n5 : (x-mean)/max(abs(x-mean))
Datavisualization
• Clustering data with K-means

• ARGUMENT :

• type:
• n5a : (x-median)/max(abs(x-median))
• n6 : x/sd
• n6a : x/mad
• n7 : x/range
• n8 : x/max
Datavisualization
• Clustering data with K-means

• ARGUMENT :

• type:
• n9 : x/mean
• n9a : x/median
• n10 : x/sum
• n11 : x/sqrt(SSQ)
Datavisualization
• Clustering data with K-means

• ARGUMENT :

• type:
• n12 : (x-mean)/sqrt(sum((x-mean)^2))
• n12a : (x-median)/sqrt(sum((x-median)^2))
• n13 : (x-midrange)/(range/2)
Datavisualization
• Clustering data with K-means

• ARGUMENT :
• "column" - normalization by variable, "row" -
normalization by object

• Remind that “mad” means Mean Absolute


Deviation = median(z – median(z)).
Datavisualization
• Clustering data with K-means

• The Median Absolute Deviation is a robust


measure of how spread out a set of data is.

• The variance and standard deviation are also


measures of spread, but they are more
affected by extremely high or extremely low
values and non normality.
Datavisualization
• Clustering data with K-means

• REMARK : Normalization/Standardization
• Therefore using the “n4” type will transform the
original variables in the dataset into variables
taking values in the interval [0,1].

• Syntax :
• mydata_3 <-
data.Normalization(mydata,type="n4",normaliza
tion="column")
Datavisualization
• Clustering data with K-means

• Patrice 15. Does it exist a way to define the best


number of clusters ?

• Answer : NO
• However there exist some methods borrowed from
parametric statistics.
• For instance : where
• is the
measurement of the intra-inertia of the clustering of size
K
• is a kind of BIC (Bayesian Information Criteria)
Datavisualization
• Clustering data with K-means

• Using this criteria :


• For K=2

• For K=3

• Hence K=3 seems to be a better number of


clusters comparing with K=2.
Datavisualization
• Clustering data with K-means

• More generally, it is possible to compute this


quantity for k = 2, 3, …, n and to take the K
whose corresponds with the smallest
Datavisualization
• Clustering data with K-means

• Remark :
• The initializing values are taken
from the dataset
Datavisualization
• Clustering data with K-means

• Drawback of the KMEANS Method :


• It is sensitive to the number of clusters that
we have specified.
• It is sensitive to the initial value of :
Datavisualization
• Clustering data with K-means

• Let us now take an example in order to


illustrate how k-means works.
• We need here only the package “dplyr”.
• Since it is already included in the package
“tydiverse”, then we do not need here to load
this package.
• We can use also the package “stat”.
Datavisualization
• Clustering data with K-means

• The syntax is:


• kmeans(nameofdata, k)
• where k is the number of clusters that we want.

• #If we want to have information about kmeans


• ?kmeans
Datavisualization
• Clustering data with K-means

• We can be more precise concerning the


syntax:
• kmeans(nameofdata, centers=k)
Datavisualization
• Clustering data with K-means

• We can be more precise concerning the syntax:


• kmeans(nameofdata,centers=k, nstart=d)

• where nstart states the number of trials with different


initial values of centroids (indeed the kmeans algo
randomly selects the initial values of centroids, and these
values have an impact on the contents of the clusters).

• The default value for this parameter is 1.


• For instance, adding nstart=10 will generate 10 initial
random centroids and choose the best one (in terms of
Total Within SS) for the algorithm.
Datavisualization
• Clustering data with K-means

• Remark:
• When using a “large” dataset, it is likely that
nstart will play a very minor role.
• Indeed, how many possibilities do we have to
choose k initial values for centroids over a set
of n individuals? For instance : n=12984, k=3.
Datavisualization
• Clustering data with K-means

• We can be more precise concerning the syntax:


• kmeans(nameofdata, centers=k, nstart=d, iter.max=v,
method=“euclidian”)

• where iter.max is the maximum number of iterations


allowed; by default it is equal to 10;
• where method specifies the distance measure to be
used; this must be one of "euclidean", "maximum",
"manhattan", "canberra", "binary", "pearson" ,
"abspearson" , "abscorrelation", "correlation",
"spearman" or "kendall“; by default it is the Euclidian
distance.
Datavisualization
• Clustering data with K-means

• Example :
• #We select two variables
• mydata_b <- mydata %>%
• select(c(salnet, age))

• For the sake of simplicity, we will work


directly with the original variables.
Datavisualization
• Clustering data with K-means

• Example :
• If we want to work with normalized or standardized
versions of the dataset, then we should first pre-process
the data.

• #Normalization
• mydata_b1=scale(mydata_b)

• #Standardization
• mydata_b2=data.Normalization(mydata_b,type="n4",nor
malization="column")
Datavisualization
• Clustering data with K-means

• Let continue our Example :

• #We select two variables


• mydata_b <- mydata %>%
• select(c(salnet, age))

• #kmeans with 3 clusters


• kmeans(mydata_b,3)
Datavisualization
• Clustering data with K-means

• Example :
• #We can also save the result of the kmeans
into an object
• sangoku<-kmeans(mydata_b,3)

• #We can visualize the result of the kmeans


• print(sangoku)
Datavisualization
• Clustering data with K-means

K-means clustering with 3 clusters of sizes


3142, 633, 9209 « 2 » means that
the algo has
converged in 2
Cluster means: iterations
salnet age
Corresponds to
1 32847.52 43.70687
Corresponds to
2 80876.33 47.68246
3 17010.33 39.68400 Corresponds to

• is the centroid of cluster “j”, j=1,2,3.


Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Individual « 19 »
belongs to Cluster 1
3 1 3 3 3 1 2 1 2 1 1 3 3 2 2 1 3 1 1

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
3 2 2 2 1 3 1 1 2 1 2 3 2 1 3 2 1 3 1

39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
2 3 2 3 3 1 1 2 2 1 2 3 2 1 2 3 3 1 2

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76

1 3 1 1 3 1 3 3 1 1 2 1 2 1 3 3 2 2 1
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

1 3 3 2 3 2 1 1 1 2 1 1 1 1 1 1 1 1 3
Datavisualization
• Clustering data with K-means

Within cluster sum of squares by cluster:


[1] 150301605982 226433357446 162589612409
(between_SS / total_SS = 83.6 %)

Available components:

[1] "cluster" "centers" "totss" "withinss"


"tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
Datavisualization
• Clustering data with K-means

• “cluster”: A vector indicating the cluster to which each point is


allocated.
• “centers”: A matrix of cluster centers.
• “totss”: The total sum of squares (over the whole dataset).
• “withinss”: Vector of within cluster sum of squares, one
component per cluster (called also wss).
• “tot.withinss”: Total within cluster sum of squares, it is equal to
sum(withinss).
• “betweenss”: The between cluster sum of squares, it is equal to
totss - tot.withinss.
• “size”: The number of points in each cluster.
• "ifault“ : Indicator of a possible algorithm problem.
Datavisualization
• Clustering data with K-means

• “withinss” for the cluster k is the sum of


squares within this cluster.

• Where is the cluster k.
Datavisualization
• Clustering data with K-means

• “tot.withinss”: Total within cluster sum of


squares, it is equal to sum(withinss).

• The total within cluster sum of square measures


the compactness (i.e goodness) of the clustering
and we want it to be as small as possible.
Datavisualization
• Clustering data with K-means

• “totss”, the total sum of squares (over the


whole dataset) writes :
where j is a characteristic.

• “betweenss”, the between cluster sum of


squares, writes : totss - tot.withinss. We want
it to be as high as possible.
Datavisualization
• Clustering data with K-means

• Using Rstudio, we can directly visualize the


contents of “sangoku” (which includes the
output of our kmeans regression).
Click on
Datavisualization
• Clustering data with K-means

• We can also print the content of each component of


“sangoku” if we want.

• For instance:
• print(sangoku$centers)

• We get:
salnet age
1 32847.52 43.70687
2 80876.33 47.68246
3 17010.33 39.68400
Datavisualization
• Clustering data with K-means

• For instance:
• print(sangoku$withinss)

• We get:
• 150301605982 226433357446 162589612409
Datavisualization
• Clustering data with K-means

• Visualize the clusters:

• ggplot(mydata_b)+geom_point(aes(salnet,ag
e), col=sangoku$cluster)
Datavisualization
• Clustering data with K-means

• Visualize the clusters:

• ggplot(mydata_b)+geom_point(aes(salnet,ag
e), col=sangoku$cluster) +
facet_wrap(~sangoku$cluster, nrow = 2)
Datavisualization
• Clustering data with K-means

• Visualize the clusters:

• Remark : If we use a labelled dataset, then


we will see the name of the individuals.
Datavisualization
• Clustering data with K-means

• Optimal number of clusters

• We have already talk about the optimal


number of clusters.

• Optimal w.r.t what ?


Datavisualization
• Clustering data with K-means

• w.rt. to: “tot.withinss”,

• Looking for K that minimizes tot.withinss. Such a


K exists and it is K=n.

• Indeed, tot.withinss = 0 if there is only one


element inside each cluster.

• Of course, this case is denegerate.


Datavisualization
• Clustering data with K-means

• w.rt. to: “tot.withinss”,

• So we are actually looking for the K such that the


marginal decrease of “tends” to zero.
• In other words, one should choose a number of
clusters so that adding another cluster doesn’t
improve much better the total WSS.
• This method is called the Elbow method.
It seems that K=5 is the optimal
number of clusters
Datavisualization
• Clustering data with K-means

• w.rt. to: “tot.withinss”+penalization,

• We can take for instance P = 2.K.nlog(n)


• Looking for a K that minimizes .
• The interpretation of P : Measurement of the
complexity of the K-clustering.
Datavisualization
• Clustering data with K-means

• w.rt. to: “betweenss”

• Looking for a K that maximizes “betweenss”


• Here again, the max is trivially obtained when
there is only one element inside each cluster.
• Hence, we actually choose a number of clusters
so that adding another cluster doesn’t improve
much better the betweenss.

Datavisualization
• Clustering data with K-means

• w.rt. to: “betweenss/totss”

• Looking for a K that maximizes “betweenss/totss” (the


explained intra-inertia).

• Here again, the max is trivially obtained when there is only


one element inside each cluster.
• Hence, we actually choose a number of clusters so that
adding another cluster doesn’t improve much better the
proportion betweenss/totss.

Datavisualization
• Clustering data with K-means

• w.r.t. to several other criteria

• Calinski-Harabasz method
• Silhouette method
• (…)
Datavisualization
• Clustering data with K-means

• “tot.withinss”

• #Compute the wss


• elb_wss <- rep(0,times=10)
• for (k in 1:10){
• clus <- kmeans(mydata_b,centers=k)
• elb_wss[k] <- clus$tot.withinss
• }
• #Plot the graph of wss w.r.t k
• plot(1:10,elb_wss,type="b",xlab="Nb. of
clusters",ylab="WSS")
3 or 4 seems to be the optimal
number of clusters
Datavisualization
• Clustering data with K-means

• “betweenss/totss”

• #Compute the proportion of explained intra-inertia


• inertia.expl <- rep(0,times=10)
• for (k in 1:10){
• clus <- kmeans(mydata_b,centers=k)
• inertia.expl[k] <- clus$betweenss/clus$totss
• }
• #Plot the graph of explained inertia w.r.t k
• plot(1:10,inertia.expl,type="b",xlab="Nb. of
clusters",ylab="% expl inertia")
3 or 4 seems to be the optimal
number of clusters
Datavisualization
• Clustering data with K-means

• Now we want to use another function called


fviz_cluster

• We need to install some packages:


• install.packages(c("factoextra", "fpc",
"NbClust"))
Datavisualization
• Clustering data with K-means

• Then we load :
• library(factoextra)
• library(fpc)
• library(NbClust)
Datavisualization
• Clustering data with K-means

• We will use a subset of mydata_b, because it seems


that fviz_nbclust does not support large dateset.

• This subset mydata_c will include 200 individuals


• Let us select them randomly.

• library(data.table)
• mydata_c <- data.table(mydata_b)
• mydata_c<-mydata_c[sample(.N, 200)]
Datavisualization
• Clustering data with K-means

• sangohan=kmeans(mydata_c,3)

• If we print sangohan
• print(sangohan)

• We get:
Datavisualization
• Clustering data with K-means

• K-means clustering with 3 clusters of sizes 15, 125, 60

• Cluster means:
• salnet age
• 1 60665.20 46.20000
• 2 16890.02 38.84000
• 3 29800.07 43.28333

• Clustering vector:
• [1] 2 2 2 1 2 2 3 2 2 2 3 2 1 2 3 3 2 2 3 2 2 2 3 1 2 2 3 3 2 2 2 3 2 2 2 2 2 2 2 2
• [41] 3 3 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 2 1 3 3 3 3 2 2 3 3 2 2 2 2 2 2 3 3 2 2
• [81] 2 3 2 3 3 3 2 2 3 2 2 2 2 3 2 2 2 2 3 2 2 3 2 2 3 2 3 3 3 2 2 3 1 3 2 2 2 2 2 2
• [121] 2 2 2 2 2 1 1 3 1 2 3 3 3 2 2 3 2 2 2 3 1 3 1 2 3 2 2 2 2 1 2 2 2 2 2 2 2 3 3 3
• [161] 2 2 2 1 2 2 2 3 2 3 1 2 2 3 2 2 1 3 2 2 3 3 3 2 2 2 2 2 3 2 2 2 3 1 2 2 3 3 2 3

• Within cluster sum of squares by cluster:


• [1] 4206924773 1695780269 1672432680
• (between_SS / total_SS = 79.0 %)

• Available components:

• [1] "cluster" "centers" "totss" "withinss" "tot.withinss"


• [6] "betweenss" "size" "iter" "ifault"
Datavisualization
• Clustering data with K-means

• Let us represent the clusters, using


fviz_cluster

• fviz_cluster(sangohan, geom = "point", data =


mydata_c) +ggtitle(“k=3”)
Datavisualization
• Clustering data with K-means

• We remark that the scales of the x-axis and


the y-axis of the graph are different.

• Explanation : Display the clusters on the two


main components (using PCA).
Datavisualization
• Clustering data with K-means

• Let us represent the clusters, using fviz_cluster,


with “text” and original values for the axes.

• fviz_cluster(sangohan, geom = “text", data =


mydata_c, stand=FALSE) +ggtitle(“k=3”)

• stand: logical value; if TRUE, data is standardized


before principal component analysis
Datavisualization
• Clustering data with K-means

• Actually with fviz_cluster, we can use another


function (instead of function KMEANS) in order
to construct the clusters.
• This function is called eclust
• However since eclust can compute also PCA and
HCA then we need to clearly specify that we
want a KMEANS.

• km.res <- eclust(mydata_c, "kmeans", k = 3)


Datavisualization
• Clustering data with K-means

• Syntax:
• fviz_cluster(object, data = NULL, stand = TRUE, geom =
c("point", "text"), frame = TRUE, frame.type =
"convex")

• object: an object of class “partition” created by the


functions pam(), clara() or fanny() in cluster package. It
can be also an output of kmeans() function in stats
package. In this case the argument data is required.
• data: the data that has been used for clustering.
Required only when object is a class of kmeans.
Datavisualization
• Clustering data with K-means

• stand: logical value; if TRUE, data is standardized before principal


component analysis
• geom: a text specifying the geometry to be used for the graph.
Allowed values are the combination of c(“point”, “text”). Use
“point” (to show only points); “text” to show only labels; c(“point”,
“text”) to show both types.
• frame: logical value; if TRUE, draws outline around points of each
cluster
• frame.type: Character specifying frame type. Possible values are
‘convex’ or types supported by ggplot2::stat_ellipse including one
of c(“t”, “norm”, “euclid”).
• We can also use inside any option of ggplot.
Datavisualization
• Clustering data : Beyond K-means

• Remark: Use PAM, CLARA, FANNY

• PAM = Partitioning Around Medoids : The use of means implies


that k-means clustering is highly sensitive to outliers. This can
severely affects the assignment of observations to clusters. A more
robust algorithm is provided by PAM algorithm (Partitioning
Around Medoids) which is also known as k-medoids clustering.
• CLARA = Clustering LARge Applications : is a partitioning method
used to deal with much larger data sets in order to reduce
computing time and RAM storage problem.
• FANNY = Fuzzy Analysis Clustering : computes a fuzzy clustering of
the data into k clusters.
Datavisualization
• Clustering data with K-means

• Visualize the clusters.

• fviz_cluster(km.res, geom = "text",


stand=FALSE)
Datavisualization
• Clustering data with K-means

• Visualize the clusters.

• fviz_cluster(km.res, geom = "point", palette =


"jco", stand=FALSE, ggtheme = theme_minimal())

• #jco is a palette of colors set by the journal of


clinical oncology
Datavisualization
• Clustering data with K-means

• Visualize the clusters.

• fviz_cluster(km.res, geom = c("point"),


stand=FALSE) + scale_color_brewer('Cluster',
palette='Set2') + scale_fill_brewer('Cluster',
palette='Set2')
Datavisualization
• Clustering data with K-means

• Optimal number of clusters using fviz_nbclust.

• Syntax:
• fviz_nbclust(x, FUNcluster, method)

• x: numeric matrix or data frame


• FUNcluster: a partitioning function such as kmeans,
pam, clara,…
• method: the method (wss, silhouette) to be used for
determining the optimal number of clusters.
Datavisualization
• Clustering data with K-means

• Optimal number of clusters using


fviz_nbclust.

• fviz_nbclust(mydata_c, kmeans, method =


"wss")
Datavisualization
• Clustering data with K-means

• Remark :
• In order to graphically illustrate how move the
observations from one cluster to another, let us
use the package “animation”.

• install.packages("animation")
• library("animation")
• kmeans.ani(mydata_c,3)
Datavisualization
• Clustering data with K-means

• Practice 15 :
1. Select randomly a subsample of 250 individuals from COI2006
2. Use the following variables : age, salnet, effl_corr, stress
(concerning this latter variable, first redefine a numerical
variable from it)
3. First analyze the distribution of your dataset with regards to
age, salnet, effl_corr, stress
4. Clustering with K=4 (with 15 initialization trials)
5. Merge your initial dataset with the number of cluster variable
6. Display the four clusters
7. What could be the optimal number of clusters on your dataset
?
Project 2019
• Analyze the dataset for 2016, that
includes the below variables, for some
countries (take as much countries as
you can).
• Your underlying analysis should be :
does it exist a link between wealth
and the so-called ESG criteria?
Variables

Control corruption CONTROL

Gov effectiveness GOVEFFEC

political stability POLITICAL

regulatory REGULATORY

rule of law RULEOFLAW

voice VOICE
Variables
Combustible renewables and waste (% of total
energy) RENEENERGY

Electricity production from renewable sources,


excluding hydroelectric (% of total) ELECRENEWABLE
Forest area (% of land area) BIODIVERSITY
Fossil fuel energy consumption (% of total) NONFOSSIL

Renewable electricity output (% of total


electricity output) ELECTRENEWABLEPRODUCT

Renewable energy consumption (% of total


final energy consumption) CONSUMPRENEWABLEENERGY
Protected area PROTECTED
Variables

Employment to population ratio, 15+, total (%) (national


estimate) EMPLOY
Health expenditure, public (% of total health expenditure) HEALTH
Life expectancy at birth, total (years) LIFEEXPEC

Ratio of female to male labor force participation rate (%)


(national estimate) MALETOFEMAL
School enrollment, secondary (% gross) ENROLLSEC
Vulnerable employment, total (% of total employment) VUL

Labor force participation rate for ages 15-24, total (%)


(national estimate) Labor
Variables
SPENDING EN EDUCATION educspen

Fertility rate, total (births per woman) Fertility

Birth rate, crude (per 1,000 people) birth

School enrollment, primary (gross), gender parity index


(GPI) GPIPRIM

School enrollment, primary and secondary (gross), gender


parity index (GPI) GPISECPRIM

School enrollment, secondary (gross), gender parity index


(GPI) GPISEC
Variables
GDP per capita GDPPC

No regressions please. Only data analysis + clustering

You can add some other variables if you want

You might also like