Lecture 3&4
Lecture 3&4
Marc-Arthur Diaye
Full Professor
University Paris 1 Pantheon-Sorbonne
Data Analytics
Datavisualization
• Concerning Correlation matrix
• Example 2 :
• Suppose that we want to change the color.
• Several methods.
• Method 1:
• #(20) means that we want a vector with with length 20
• #We take here three colors : red, black, and blue
• #The colors will be a mixed of these three colors
• Example 2 :
• Suppose that we want to change the color.
• Several methods.
• Method 2:
• #Use the package RColorBrewer
• #Select a specific palette of colors (for example «RdBu») and the
number of colors from this palette (for exemple n=8)
• #If RColorBrewer is not already installed, you need first to do so
• install.packages("RColorBrewer")
• library(RColorBrewer)
• corrplot(m_newdata2, col=brewer.pal(n=8, name="RdBu"))
RColorBrewer
Datavisualization
• Remark:
• display.brewer.all()
• Example 2 :
• Method 2:
• #Let now go back to the instruction:
• corrplot(m_newdata2, col=brewer.pal(n=8,
name="RdBu"))
Datavisualization
• Concerning Correlation matrix
• Example 2 :
• Suppose that we want to change the color.
• Several methods.
• Method 3:
• #Use the package Wes Anderson
• #Select a specific palette of colors (for example «Darjeeling1») and
the number of colors from this palette (for exemple n=5)
• #If Wes Anderson is not already installed, you need first to do so
• install.packages("wesanderson")
• library(wesanderson)
• corrplot(m_newdata2, col=wes_palette(n=5,
name="Darjeeling1"))
Datavisualization
• Concerning Correlation matrix
• Example 2 :
• Suppose that we want to change the color.
• Several methods.
• Method 3:
• #Use the package Wes Anderson
• Example 2b :
• Suppose that we want to change the
backgroung color (bg).
• And the color of the variables’ names (tl.col)
• corrplot(m_newdata2, col=c("black",
"white"), bg="lightblue", tl.col="black")
Datavisualization
• Concerning Correlation matrix
• Example 3 :
• corrplot(m_newdata2, method="pie")
Datavisualization
• Concerning Correlation matrix
• Example 3b :
• corrplot(m_newdata2, method="ellipse")
Datavisualization
• Concerning Correlation matrix
• Example 4 :
• corrplot(m_newdata2, method="color")
Datavisualization
• Concerning Correlation matrix
• Example 5 :
• corrplot(m_newdata2, method="number")
Datavisualization
• Concerning Correlation matrix
• Example 6 :
• corrplot(m_newdata2, method="color",
type="lower")
• Example 7 :
• corrplot(m_newdata2, method="color",
type="upper")
• Example 8 :
• corrplot(m_newdata2, order="AOE",
method="color", addCoef.col = "#999999")
• Example 8b :
• corrplot(m_newdata2, order=“alphabet")
• #Alphabet order
Datavisualization
• Concerning Correlation matrix
• Example 9 :
• corrplot.mixed(m_newdata2, order="AOE")
Datavisualization
• Concerning Correlation matrix
• Example 10 :
• res<-cor.mtest(m_newdata2, conf.level=.99)
• corrplot(m_newdata2, p.mat=res$p, sig=.1)
• Example 10 :
• res<-cor.mtest(m_newdata2, conf.level=.99)
• corrplot(m_newdata2, p.mat=res$p, sig=.1)
• Example 10b :
• res<-cor.mtest(m_newdata2, conf.level=.95)
• corrplot(m_newdata2, p.mat=res$p, sig=.01)
Datavisualization
• Concerning Correlation matrix
• Example 10c :
• corrplot(m_newdata2, p.mat=res$p,
insig="blank")
Datavisualization
• Concerning Correlation matrix
• Example 10d :
• corrplot(m_newdata2, p.mat=res$p, insig="p-
value")
• Example 10e :
• Example 10f :
• Example 11 :
• ggplot(mydata) +
• aes(x = age, y = salnet) +
• geom_bin2d() +
• xlab("Age") +
• ylab("Net wage") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : geometric objects
• Interpretation :
• Workers having netwage = 100000 euro are between
28 and 65 years old. Moreover it seems that there are
not a lot of workers having 100000 euro a year.
• It seems that the most numerous workers are around
25 years old and get around 17000 euro a year, or are
around 33 years old and get around 21500 euro a year,
or are between 40 and 43 years old and get around
21500 euro a year.
Datavisualization
• Creating a ggplot : geometric objects
• install.packages("hexbin")
• library(hexbin)
• ggplot(mydata) +
• aes(x = age, y = salnet) +
• geom_hex() +
• xlab("Age") +
• ylab("Net wage") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : geometric objects
• Practice 11
• Solution to Practice 11
• #bins=10, change the color, change the
background
• ggplot(mydata, mapping=aes(age, salnet)) +
• scale_fill_gradient(low = "#00FF00", high =
"#FFFF00") + geom_hex(bins = 10) +
theme_classic() + xlab("age") + ylab("Net
wage")
Datavisualization
• Creating a ggplot : geometric objects
• ggplot(mydata) +
• aes(x = cscor, y = ag5) +
• geom_bin2d() +
• xlab("cscor") +
• ylab("Age") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : geometric objects
• Interpretation :
• Blue-collars are mainly between 40 and 49 years old.
• Workers between 40 and 49 years old are mainly Blue-collars.
• Blue-collars between 40 and 49 years old are the most numerous.
• ggplot(mydata) +
• aes(x = cscor, y = ag5) +
• geom_hex() +
• xlab("cscor") +
• ylab("Age") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : geometric objects
• Interpretation :
• The most numerous workers are those between
30 and 39. They are followed by those between
40 and 49; followed by those between 15 and
29; an finally the 50 and more.
• ag5
• 15-29 30-39 40-49 50-59 60 and more
• 2384 4059 3951 2525 65
Datavisualization
• Creating a ggplot : geometric objects
• ggplot(mydata) +
• aes(x = cscor, y = age) +
• geom_hex() +
• xlab("cscor") +
• ylab("Age") +
• labs(fill = "Frequency")
Datavisualization
• Creating a ggplot : annotation
• ggplot(mydata) + geom_text(aes(x=ag5,
y=salnet, label = v_manager))
Datavisualization
• Creating a ggplot : annotation
• mydata$v_manager<-rep("Non
manager",length(cscor))
• mydata$v_manager[cscor %in%
c(3,4)]="Manager"
• table(mydata$v_manager)
Datavisualization
• Creating a ggplot : annotation
• ggplot(mydata) +
• geom_text(aes(x=ag5, y=salnet, label =
v_manager, color=sex))
Datavisualization
• Creating a ggplot : annotation
• best_in_age =
mydata[c("sex","sexe","salnet","cscor","age","ag
5","v_manager")]
v_manager salnet
ag5
• Interpretation:
• Concerning workers between 15 and 29 years
old, the non manager who earns the highest
net wage is a man. His wage is around 41K€.
The manager who earns the highest net wage
can be a woman or a man. His/her wage is
100K€.
Datavisualization
• Creating a ggplot : annotation
• install.packages("ggrepel")
• library(ggrepel)
•
• ggplot(mydata, aes(ag5, salnet)) +
• geom_point(aes(colour = sex)) +
• geom_point(size = 3, shape = 1, data =
best_in_age2) +
• ggrepel::geom_label_repel(aes(label =
v_manager), data = best_in_age2)
Zoom in order to better see the output
Datavisualization
• Creating a ggplot : annotation
• In our code
• best_in_age2<-best_in_age %>%
• group_by(ag5,v_manager) %>%
• summarize(salnet = max(salnet, na.rm = TRUE))
• ggrepel::geom_label_repel
• means : use the “geom_label_repel” function
from the “ggrepel” package.
Datavisualization
• Creating a ggplot : geometric objects
• Remark on geom_label
• Of course, like geom_text, it is possible to use
an geom_label alone.
• Practice 12
• Solution to Practice 12
• newdata<-subset(mydata, nafen_g4%in%c("EU","ET","EV"))
• table(newdata$nafen_g4)
•
• best_in_bs = newdata[c("nafen_g4","nafen_g16","ag5", "age", "salnet")]
• best_in_bs2<-best_in_bs %>%
• group_by(ag5,nafen_g4) %>%
• summarize(salnet = max(salnet, na.rm = TRUE))
•
• ggplot(newdata, aes(ag5, salnet)) +
• geom_point(aes(colour = nafen_g16)) +
• geom_point(size = 2, shape = 1, data = best_in_bs2) +
• ggrepel::geom_label_repel(aes(label = nafen_g4), data = best_in_bs2)
Datavisualization
• Creating a ggplot : statistical transformation
• Example:
• ggplot(data = mydata) + geom_bar(mapping
= aes(x = cscor))
Datavisualization
• Creating a ggplot : Position adjustment
• Example:
• #Use of the colour aesthetic
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, colour=cscor))
Datavisualization
• Creating a ggplot : Position adjustment
• Example:
• #Use of the fill aesthetic
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=cscor))
Datavisualization
• Creating a ggplot : Position adjustment
• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=cscor)) +
scale_fill_manual(values=c("blue",
"#FF3399", "#FFFF33", "#00FF00"))
Datavisualization
• Creating a ggplot : Position adjustment
• Practice 13
• Solution to Practice 13
• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=sex))
Datavisualization
• Creating a ggplot : Position adjustment
• Example:
• #Use of the colour aesthetic
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, colour=sex))
Datavisualization
• Creating a ggplot : Position adjustment
• Example:
• #Use of the colour aesthetic
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, colour=sex), fill=NA)
Datavisualization
• Creating a ggplot : Position adjustment
• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=sex), position=“fill”)
Datavisualization
• Creating a ggplot : Position adjustment
• Example:
• ggplot(data = mydata) + geom_bar(mapping =
aes(x = cscor, fill=sex), position=“dodge”)
Datavisualization
• Creating a ggplot : statistical transformation
• Example:
• # bars filled with other colors
• ggplot(data=mydata, aes(x=cscor, y=salnet)) +
geom_bar(stat="identity", fill="#FF9999")
Datavisualization
• Creating a ggplot : statistical transformation
• Example:
• # bars filled with other colors
• ggplot(data=mydata, aes(x=cscor, y=salnet)) +
geom_bar(stat="identity", fill=“blue")
Datavisualization
• Creating a ggplot : statistical transformation
• Example :
• ggplot(data=mydata, aes(x=cscor, y=salnet)) +
geom_point()+coord_flip()
The two axes are switched
Datavisualization
• Creating a ggplot : Coordinate systems
• Syntax :
• coord_polar(theta, start, direction, clip)
Datavisualization
• Creating a ggplot : Coordinate systems
• Syntax :
• coord_polar(theta, start, direction, clip)
• theta : variable to map angle to (x or y)
• start : offset of starting point from 12 o'clock in radians
• direction : 1, clockwise; -1, anticlockwise
• clip : should drawing be clipped to the extent of the plot panel? A
setting of "on" (the default) means yes, and a setting of "off"
means no.
• Syntax :
• coord_polar(theta, start, direction, clip)
• For instance if we run:
• coord_polar()
• The polar coordinate system is most commonly
used for pie charts, which are a stacked bar chart
in polar coordinates.
• To this purpose, we will use “factor”
• Example :
• ggplot(data=mydata, aes(x=factor(1),
fill=factor(cscor))) + geom_bar(width =
1)+coord_polar(theta=“y”)+labs(x=NULL)
Datavisualization
• Creating a ggplot : Coordinate systems
• coord_polar()
• The polar coordinate system can be used for a
bullseye chart.
• To this purpose, we will use also “factor”
• Example :
• ggplot(data=mydata, aes(x=factor(1),
fill=factor(cscor))) + geom_bar(width =
1)+coord_polar()+labs(x=NULL)
Clustering with K-means
CLASS 4
Datavisualization
• Clustering data with K-means
• https://www.sciencedirect.com/topics/comp
uter-science/minkowski-distance
Datavisualization
• Clustering data with K-means
• Let n be the number of observations from the dataset.
• Let p be the number of characteristics (attributes): j=1
to p.
• is the vector of observations concerning
characteristic « j »
•
• is the observation concerning characteristic « j »
and individual « i »
• is the matrix (n,p) of the vectors
Datavisualization
• Minimization of ( ) ( )
• We enter and K
• Let , k=1 to K, be the initial values of the centroids
• t=1
• Do until the STOP CRITERIA is satisfied :
• Assign each observation x to a cluster : (𝐭)
(𝐭 𝟏) (𝐭 𝟏)
(𝐭)
• Let be the set of observations assigned to the cluster k :
(𝐭) (𝐭)
Datavisualization
• t=t+1
Datavisualization
• Example 1
• Let us consider a dataset with 5 individuals and 1
characteristic Individual Characteristic 1
1 1
2 2
3 9
4 12
5 20
Datavisualization
• Question:
• Apply K-means algorithm with
• K=2
• and ,
Datavisualization
• Clustering data with K-means
• Answer:
• t=1 𝑥−𝜇 = 1−1 =0
x 1 2 9 12 20
( ) 0 1 64 121 361
( ) 361 324 121 64 0
1 1 1 2 2
( ) ( )
•
Datavisualization
• Clustering data with K-means
•
•
• UPDATE
•
•
Datavisualization
• Clustering data with K-means
• Answer:
• t=2 𝑥−𝜇 = 1−4 =9
x 1 2 9 12 20
( ) 9 4 25 64 256
( ) 225 196 49 16 16
1 1 1 2 2
( ) ( )
•
Datavisualization
• Clustering data with K-means
• Then we stop.
• 1 2 9 12 20
•
•
•
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• As we have explained above, Kmeans uses a
distance function in order to determine the
various clusters. As a consequence, the units of
the variables will be of a high importance.
• For instance, suppose that there are two
variables : Age (in years) and height (in cm).
• Suppose that the Age variable ranges from 18 to
50; while the height variable ranges from 130 to
210.
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• REMARK : Normalization/Standardization
• REMARK : Normalization/Standardization
• The terms normalization and standardization
are sometimes used interchangeably, but
they actually refer to different things.
• Normalization means to scale a variable to
have values between 0 and 1.
• Standardization transforms data to have a
mean of zero and a standard deviation of 1.
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• Sometimes what I called “normalization” is
called “standardization” in some textbook
and vice-versa.
• What is important is that the two terms
means different things.
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• Normalization with a z-score : Value
concerning characteristic “z” for individual “i”
is transformed by : where and are
respectively the mean value and the
standard-deviation of z.
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• Standardization using the min and max values
: Value concerning characteristic “z” for
individual “i” is transformed by :
where and are respectively the
minimal and the maximal values of z.
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• Normalization can be obtained using the so-
called procedure “scale”.
• Suppose that mydata is the data set that you
want to use. Then we can normalize the
variables in this dataset using the “scale”
procedure:
• mydata_2=scale(mydata)
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• Or we can do it by ourselves directly. Suppose
that there two variables (myVar1, myVar2) in
mydata. Then we can create their normalized
version:
• mydata$zVar1 <- (mydata$myVar1 -
mean(mydata$myVar1))/sd(mydata$myVar1)
• mydata$zVar2 <- (mydata$myVar2 -
mean(mydata$myVar2))/sd(mydata$myVar2)
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• Likewise, assuming that there two variables
(myVar1, myVar2) in mydata; we can create their
standardized version:
• mydata$sVar1 <- (mydata$myVar1 -
min(mydata$myVar1))/(max(mydata$myVar1)-
min(mydata$myVar1))
• mydata$sVar2 <- (mydata$myVar2 -
min(mydata$myVar2))/(max(mydata$myVar2)-
min(mydata$myVar2))
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• We can also standardized (or normalized) the
data using data.Normalization function in
clusterSim package.
• install.packages("clusterSim")
• library(clusterSim)
Datavisualization
• Clustering data with K-means
• REMARK : Normalization/Standardization
• Syntax :
• data.Normalization(x,type=“…",normalization
="column")
Datavisualization
• Clustering data with K-means
• ARGUMENT :
• ARGUMENT :
• type:
• n3a : (x-median)/range
• n4 : x-min)/range
• n5 : (x-mean)/max(abs(x-mean))
Datavisualization
• Clustering data with K-means
• ARGUMENT :
• type:
• n5a : (x-median)/max(abs(x-median))
• n6 : x/sd
• n6a : x/mad
• n7 : x/range
• n8 : x/max
Datavisualization
• Clustering data with K-means
• ARGUMENT :
• type:
• n9 : x/mean
• n9a : x/median
• n10 : x/sum
• n11 : x/sqrt(SSQ)
Datavisualization
• Clustering data with K-means
• ARGUMENT :
• type:
• n12 : (x-mean)/sqrt(sum((x-mean)^2))
• n12a : (x-median)/sqrt(sum((x-median)^2))
• n13 : (x-midrange)/(range/2)
Datavisualization
• Clustering data with K-means
• ARGUMENT :
• "column" - normalization by variable, "row" -
normalization by object
• REMARK : Normalization/Standardization
• Therefore using the “n4” type will transform the
original variables in the dataset into variables
taking values in the interval [0,1].
• Syntax :
• mydata_3 <-
data.Normalization(mydata,type="n4",normaliza
tion="column")
Datavisualization
• Clustering data with K-means
• Answer : NO
• However there exist some methods borrowed from
parametric statistics.
• For instance : where
• is the
measurement of the intra-inertia of the clustering of size
K
• is a kind of BIC (Bayesian Information Criteria)
Datavisualization
• Clustering data with K-means
• For K=3
•
• Remark :
• The initializing values are taken
from the dataset
Datavisualization
• Clustering data with K-means
• Remark:
• When using a “large” dataset, it is likely that
nstart will play a very minor role.
• Indeed, how many possibilities do we have to
choose k initial values for centroids over a set
of n individuals? For instance : n=12984, k=3.
Datavisualization
• Clustering data with K-means
• Example :
• #We select two variables
• mydata_b <- mydata %>%
• select(c(salnet, age))
• Example :
• If we want to work with normalized or standardized
versions of the dataset, then we should first pre-process
the data.
• #Normalization
• mydata_b1=scale(mydata_b)
• #Standardization
• mydata_b2=data.Normalization(mydata_b,type="n4",nor
malization="column")
Datavisualization
• Clustering data with K-means
• Example :
• #We can also save the result of the kmeans
into an object
• sangoku<-kmeans(mydata_b,3)
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
3 2 2 2 1 3 1 1 2 1 2 3 2 1 3 2 1 3 1
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
2 3 2 3 3 1 1 2 2 1 2 3 2 1 2 3 3 1 2
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 3 1 1 3 1 3 3 1 1 2 1 2 1 3 3 2 2 1
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
1 3 3 2 3 2 1 1 1 2 1 1 1 1 1 1 1 1 3
Datavisualization
• Clustering data with K-means
Available components:
• For instance:
• print(sangoku$centers)
• We get:
salnet age
1 32847.52 43.70687
2 80876.33 47.68246
3 17010.33 39.68400
Datavisualization
• Clustering data with K-means
• For instance:
• print(sangoku$withinss)
• We get:
• 150301605982 226433357446 162589612409
Datavisualization
• Clustering data with K-means
• ggplot(mydata_b)+geom_point(aes(salnet,ag
e), col=sangoku$cluster)
Datavisualization
• Clustering data with K-means
• ggplot(mydata_b)+geom_point(aes(salnet,ag
e), col=sangoku$cluster) +
facet_wrap(~sangoku$cluster, nrow = 2)
Datavisualization
• Clustering data with K-means
• Calinski-Harabasz method
• Silhouette method
• (…)
Datavisualization
• Clustering data with K-means
• “tot.withinss”
• “betweenss/totss”
• Then we load :
• library(factoextra)
• library(fpc)
• library(NbClust)
Datavisualization
• Clustering data with K-means
• library(data.table)
• mydata_c <- data.table(mydata_b)
• mydata_c<-mydata_c[sample(.N, 200)]
Datavisualization
• Clustering data with K-means
• sangohan=kmeans(mydata_c,3)
• If we print sangohan
• print(sangohan)
• We get:
Datavisualization
• Clustering data with K-means
• Cluster means:
• salnet age
• 1 60665.20 46.20000
• 2 16890.02 38.84000
• 3 29800.07 43.28333
• Clustering vector:
• [1] 2 2 2 1 2 2 3 2 2 2 3 2 1 2 3 3 2 2 3 2 2 2 3 1 2 2 3 3 2 2 2 3 2 2 2 2 2 2 2 2
• [41] 3 3 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 2 1 3 3 3 3 2 2 3 3 2 2 2 2 2 2 3 3 2 2
• [81] 2 3 2 3 3 3 2 2 3 2 2 2 2 3 2 2 2 2 3 2 2 3 2 2 3 2 3 3 3 2 2 3 1 3 2 2 2 2 2 2
• [121] 2 2 2 2 2 1 1 3 1 2 3 3 3 2 2 3 2 2 2 3 1 3 1 2 3 2 2 2 2 1 2 2 2 2 2 2 2 3 3 3
• [161] 2 2 2 1 2 2 2 3 2 3 1 2 2 3 2 2 1 3 2 2 3 3 3 2 2 2 2 2 3 2 2 2 3 1 2 2 3 3 2 3
• Available components:
• Syntax:
• fviz_cluster(object, data = NULL, stand = TRUE, geom =
c("point", "text"), frame = TRUE, frame.type =
"convex")
• Syntax:
• fviz_nbclust(x, FUNcluster, method)
• Remark :
• In order to graphically illustrate how move the
observations from one cluster to another, let us
use the package “animation”.
• install.packages("animation")
• library("animation")
• kmeans.ani(mydata_c,3)
Datavisualization
• Clustering data with K-means
• Practice 15 :
1. Select randomly a subsample of 250 individuals from COI2006
2. Use the following variables : age, salnet, effl_corr, stress
(concerning this latter variable, first redefine a numerical
variable from it)
3. First analyze the distribution of your dataset with regards to
age, salnet, effl_corr, stress
4. Clustering with K=4 (with 15 initialization trials)
5. Merge your initial dataset with the number of cluster variable
6. Display the four clusters
7. What could be the optimal number of clusters on your dataset
?
Project 2019
• Analyze the dataset for 2016, that
includes the below variables, for some
countries (take as much countries as
you can).
• Your underlying analysis should be :
does it exist a link between wealth
and the so-called ESG criteria?
Variables
regulatory REGULATORY
voice VOICE
Variables
Combustible renewables and waste (% of total
energy) RENEENERGY