Visualisation Des Données Avec R: F. Mhamdi

Télécharger au format pdf ou txt
Télécharger au format pdf ou txt
Vous êtes sur la page 1sur 56

Visualisation des données avec R

F. MHAMDI

20-03-2023
Plan

1. Libraries R interessantes pour la visualisation des données


. ggplot2: programmation
. sjPlots: pour les chercheurs dans le domaine
socio-économique
. tabplot: données de taille importante
2. Visualisation de données multivariées:
. Variables qualitatives
. Variables quantitatives
3. Cartographie
ggplot2 (Référence : Hadley Wickham, 2005)
ggplot2 (Exemple 1)

Data Frame :
dat <- data.frame(time = factor(c(“Lunch”,“Dinner”),
levels=c(“Lunch”,“Dinner”)),
total_bill = c(14.89, 17.23))
dat

## time total_bill
## 1 Lunch 14.89
## 2 Dinner 17.23
ggplot2 (Exemple 1)

library(ggplot2) > ggplot(data=dat, aes(x=time, y=total_bill,


fill=time)) +
geom_bar(colour=“black”, fill=“#DD8888”, width=.8,
stat=“identity”) +
guides(fill=FALSE) +
xlab(“Time of day”) + ylab(“Total bill”) +
ggtitle(“Average bill for 2 people”)
ggplot2 (Exemple 1)
## Warning: package ’ggplot2’ was built under R version 4.2

## Warning: The ‘<scale>‘ argument of ‘guides()‘ cannot be


## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call ‘lifecycle::last_lifecycle_warnings()‘ to see where
## generated.
Average bill for 2 people

15

10
Total bill
ggplot2 (Exemple 2)
> library(reshape2)
> data(tips)
> head(tips)
> levels(tips$day)
> tips$day=factor(tips$day,levels=levels(tips$day)[c(4,1,2,3)])

## total_bill tip sex smoker day time size


## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4

## [1] "Fri" "Sat" "Sun" "Thur"


ggplot2 (Exemple 2)
> ggplot(data=tips, aes(x=day)) +
geom_bar(stat=“count”)

75

50
count

25

Thur Fri Sat Sun


ggplot2 (Exemple 3)
> library(plyr)
> # Calculate the mean of tip for each day
#ddply: Split data frame, apply function, and return results in a
data frame.
> mtips <- ddply(tips, “day”, summarise, mtip = mean(tip))
>
mtips$day=factor(mtips$day,levels=levels(mtips$day)[c(4,1,2,3)])
> mtips

## day mtip
## 1 Thur 2.771452
## 2 Fri 2.734737
## 3 Sat 2.993103
## 4 Sun 3.255132
ggplot2 (Exemple 3)

ggplot(data=mtips, aes(x=day,y=mtip)) +
geom_bar(stat=“identity”,fill=“red”,alpha=.6)+
theme_bw()+xlab(“Day”)+
ylab(“Average of tips”)
ggplot2 (Exemple 4)

2
Average of tips

Sun Thur Fri Sat


Day
ggplot2 (Exercice 1)

Pour le même jeu de données, construire un data frame contenant :


1- les moyennes par type de jour de la variable tip
2- les écarts type par type de jour de la variable tip
3- Les bornes inférieures des intervalles de confiance de niveau 0.95
de la répartition par type de jour de la variable tip
4- Les bornes supérieures des intervalles de confiance de niveau
0.95 de la répartition par type de jour de la variable tip
[Indication : pour les intervalles de confiances, on suppose une loi
normale pour la variable tip].
ggplot2 (Solution)

> library(plyr)
> mtips <- ddply(tips, “day”, summarise, mtip =
mean(tip),stip=sd(tip))
>
mtips$day=factor(mtips$day,levels=levels(mtips$day)[c(4,1,2,3)])
> mtips$lower=mtips$mtip-2*mtips$stip
> mtips$upper=mtips$mtip+2*mtips$stip
>
mtips$day=factor(mtips$day,levels=levels(mtips$day)[c(4,1,2,3)])
ggplot2 (Solution)

> mtips

## day mtip stip lower upper


## 1 Thur 2.771452 1.240223 0.2910052 5.251898
## 2 Fri 2.734737 1.019577 0.6955827 4.773891
## 3 Sat 2.993103 1.631014 -0.2689252 6.255132
## 4 Sun 3.255132 1.234880 0.7853710 5.724892
ggplot2 (Solution)

> ggplot(mtips,aes(x=day,y=mtip,group=day))+
geom_errorbar(aes(ymin=lower,ymax=upper,width=.2))+
geom_point(size=3)+theme_bw()+xlab(“Day”)+ylab(“Average
of tips”)
ggplot2 (Solution)

4
Average of tips

Sat Sun Thur Fri


Day
ggplot2 (Exercice 2)

Pour le même jeu de données, construire un data frame contenant :


1- les moyennes, les écarts type, les bornes inférieures et les bornes
supérieures des intervalles de confiance de niveau 0.95 de la
répartition de la variable “tip” par croisement des variables
qualitatives type de jour (day), sexe (sex), et les modlités de la
variable “Smoker”
2- Quelle est la structure du graphique permettant de résumer la
totalité de cette information
[Indication : pour les intervalles de confiances, on suppose une loi
normale pour la variable tip].
ggplot2 (Solution)

> library(plyr)
> mtips <- ddply(tips, c(“day”,“sex”,“smoker”), summarise, mtip
= mean(tip),stip=sd(tip))
>
mtips$day=factor(mtips$day,levels=levels(mtips$day)[c(4,1,2,3)])
> mtips$lower=mtips$mtip-2*mtips$stip
> mtips$upper=mtips$mtip+2*mtips$stip
>
mtips$day=factor(mtips$day,levels=levels(mtips$day)[c(4,1,2,3)])
ggplot2 (Solution)
> mtips

## day sex smoker mtip stip lower


## 1 Thur Female No 2.459600 1.0783687 0.30286265 4.6
## 2 Thur Female Yes 2.990000 1.2040487 0.58190255 5.3
## 3 Thur Male No 2.941500 1.4856233 -0.02974659 5.9
## 4 Thur Male Yes 3.058000 1.1115735 0.83485308 5.2
## 5 Fri Female No 3.125000 0.1767767 2.77144661 3.4
## 6 Fri Female Yes 2.682857 1.0580125 0.56683212 4.7
## 7 Fri Male No 2.500000 1.4142136 -0.32842712 5.3
## 8 Fri Male Yes 2.741250 1.1668081 0.40763386 5.0
## 9 Sat Female No 2.724615 0.9619045 0.80080640 4.6
## 10 Sat Female Yes 2.868667 1.4613783 -0.05409002 5.7
## 11 Sat Male No 3.256562 1.8397486 -0.42293469 6.9
## 12 Sat Male Yes 2.879259 1.7443379 -0.60941660 6.3
## 13 Sun Female No 3.329286 1.2823564 0.76457293 5.8
## 14 Sun Female Yes 3.500000 0.4082483 2.68350342 4.3
## 15 Sun Male No 3.115349 1.2164005 0.68254779 5.5
ggplot2 (Solution)

> pd <- position_dodge(0.4)


> ggplot(mtips,aes(x=sex,y=mtip,col=smoker,group=smoker))+
geom_errorbar(aes(ymin=lower,ymax=upper),position=pd,width=.2)+
geom_point(size=3,position=pd)+theme_bw()+xlab(“Gender”)+
ylab(“Average of tips”)+facet_wrap(~day)
ggplot2 (Solution)
Sat Sun

0
Average of tips

smoker

Thur Fri No
Yes

Female Male Female Male


Gender
ggplot2 (Solution)

Autrement :
> ggplot(tips,aes(x=sex,y=tip,col=smoker,fill=smoker))+
geom_boxplot(position=pd,width=.2,alpha=.5)+theme_bw()+
xlab(“Gender”)+ylab(“Tips”)+facet_wrap(~day)
ggplot2 (Solution)
Thur Fri

10.0

7.5

5.0

2.5

smoker
Tips

Sat Sun No

10.0 Yes

7.5

5.0

2.5

Female Male Female Male


Gender
ggplot2 (Exercice 3)

Question: Tips in term of Smoker x Gender x Time


(Lunch/Dinner)
ggplot2 (Solution)

> ggplot(tips,aes(x=day,y=tip,col=time,fill=time))+
geom_boxplot(alpha=.4)+theme_bw()+xlab(“Tips”)+ylab(“ “)+
facet_grid(sex~smoker)+ggtitle(“Tips in term of Smoker x
Gender”)
ggplot2 (Solution)
Tips in term of Smoker x Gender
No Yes

10.0

7.5

Female
5.0

2.5

time
Dinner
10.0 Lunch

7.5

Male
5.0

2.5

Thur Fri Sat Sun Thur Fri Sat Sun


Tips
sjPlot

▶ Auteur : Daniel Lüdecke [email protected]


▶ Site Web : http://www.strengejacke.de/sjPlot/
▶ C’est un package de visualisation de données pour les
statistiques en sciences sociales
▶ Il contient des fonctions pour importer des données de
différents formats : SPSS, STATA, SAS. . . etc.
▶ Il permet l’étiquetage et la manlation des variables factorielles
dans les données.
sjPlot (Définir un thème)
> library(sjPlot)
>library(sjmisc)
>library(ggplot2)
>set_theme(geom.outline.color = “antiquewhite4”,
geom.outline.size = 1,
geom.label.size = 2, geom.label.color = “black”,
title.color = “red”, title.size = 1.5, axis.textcolor = “blue”,
base = theme_bw())

## Warning: package ’sjPlot’ was built under R version 4.2.

## Learn more about sjPlot with ’browseVignettes("sjPlot")’

## Warning: package ’sjmisc’ was built under R version 4.2.

## Install package "strengejacke" from GitHub (‘devtools::i


sjPlot : Diagramme en bâtons

Exemple 1
library(sjPlot)
data(efc)
class(efc)
plot_frq(efc$tot_sc_e)
sjPlot : Diagramme en bâtons
500

403
(44.4%)
400

300 278
(30.6%)

200

120
(13.2%)

100
62
(6.8%)

25
(2.8%)
11
(1.2%) 5 2 2
(0.5%) (0.2%) (0.2%)
0
0 1 2 3 4 5 6 7 9
Services for elderly
sjPlot : Diagramme en bâtons

Exemple 2 :
attr(efc$e42dep,“labels”)
plot_frq(efc$e42dep)
sjPlot : Diagramme en bâtons

306 304
(34.0%) (33.7%)

300

225
(25.0%)

200

100

66
(7.3%)

0
independent slightly dependent moderately dependent severely dependent
elder's dependency
sjPlot : Diagramme en bâtons
Exemple 3
plot_frq(efc$e42dep,coord.flip = T,geom.size = .4)

304 (33.7%)
severely dependent

306 (34.0%)
moderately dependent
elder's dependency

225 (25.0%)
slightly dependent

66 (7.3%)
independent
sjPlot : Diagramme en bâtons
sjp.frq(efc$e42dep,show.prc = T,show.n = F)

34.0% 33.7%

300

25.0%

200

100

7.3%

0
independent slightly dependent moderately dependent severely dependent
elder's dependency
sjPlot : Tableau de contingence

Exemple 1
xtabs(~efc$e16sex+efc$e42dep)

## efc$e42dep
## efc$e16sex 1 2 3 4
## 1 23 70 109 93
## 2 43 154 197 211
sjPlot : Tableau de contingence
sjp.xtab(x = efc$e42dep, grp = efc$e16sex)

40%

37.0%

34.9%
34.0% 33.8%

32.6%
31.5%

25.4%
24.9%
23.7%
elder's gender
male
female
20% Total

7.8%
7.1% 7.3%

0%
independent slightly moderately severely
dependent dependent dependent
elder's dependency
sjPlot : Tableau de contingence
sjp.xtab(x = efc$e42dep, grp = efc$e16sex,show.n = F,show.total
= F, type=“bar”)

40%

37.0%

34.9%

32.6%
31.5%

25.4%

23.7%
elder's gender
male
female
20%

7.8%
7.1%

0%
independent slightly moderately severely
dependent dependent dependent
elder's dependency
sjPlot : Tableau de contingence
Autres représentations :
▶ changer type=line
▶ Ajouter l’option : bar.pos = “stack”
▶ Ajouter les deux options : bar.pos=“stack”,margin=“row”
100%

80%

60%

32.6% elder's gender


34.9% male
female

40%
25.4%

20% 37.0%

31.5%

7.1% 23.7%
sjPlot : Tableau de contingence

plot_xtab(x = efc$e42dep, grp = efc$e16sex,show.n =


F,show.total = F,
type=“bar”,bar.pos=“stack”,margin=“row”)
sjPlot : Tableau de contingence
100%

80%

65.2% 64.4%
68.8% 69.4%

60%

elder's gender
male
female

40%

20%
34.9% 35.6%
31.2% 30.6%

0%
independent slightly moderately severely
dependent dependent dependent
elder's dependency
sjPlot : Stacked bar plot

Trace plusieurs variables possédant les mêmes catégories.


> # recveive first item of COPE-index scale > start <-
which(colnames(efc) == “c82cop1”) > # recveive first item of
COPE-index scale > end <- which(colnames(efc) == “c90cop9”)
> sjp.stackfrq(efc[, start:end], expand.grid = TRUE, geom.size =
.4,sort.frq = “last.desc”)
sjPlot : Stacked bar plot

do you feel caregiving 8.6% 23.6% 33.8% 34.0%


worthwhile? (n=888)

do you feel you cope well as 0.3% 10.8% 65.6% 23.3%


caregiver? (n=901)

do you feel supported by 34.7% 26.3% 26.8% 12.2%


friends/neighbours? (n=901)

do you feel trapped in your 37.3% 41.6% 12.6% 8.6%


role as caregiver? (n=900)

never
does caregiving have negative
sometimes
effect on your physical 45.6% 38.5% 9.5% 6.5%

health? (n=898) often

does caregiving cause always


difficulties in your 57.2% 27.9% 9.1% 5.8%
relationship with your
friends? (n=902)

do you find caregiving too 20.6% 60.6% 14.4% 4.3%


demanding? (n=902)

does caregiving cause


financial difficulties? 79.2% 14.6% 4.3%1.9%

(n=900)

does caregiving cause


difficulties in your 69.4% 23.4% 5.5% 1.7%
relationship with your family?
(n=902)

0% 20% 40% 60% 80% 100%


sjPlot : Exercice

1- Simuler un jeu de données (data frame de 500 observations) :


▶ Cinq éléments (colonnes) :
▶ Chaque élément a 4 valeurs de catégorie, deux valeurs dites
“positives”

(d’accord et tout à fait d’accord) versus deux valeurs négatives


(pas d’accord et
fortement en désaccord).
2- Changer les probabilités des catégories de chaque article.
3- Choisir la bonne représentation pour ce jeu de données.
sjPlot : Solution

df <- data.frame(
question1 = as.factor(sample(1:4, 500, replace = TRUE)),
question2 = as.factor(sample(1:4, 500, replace = TRUE)),
question3 = as.factor(sample(1:4, 500, replace = TRUE)),
question4 = as.factor(sample(1:4, 500, replace = TRUE)),
question5 = as.factor(sample(1:4, 500, replace = TRUE))
)
sjPlot : Solution

head(df)

## question1 question2 question3 question4 question5


## 1 2 3 1 1 4
## 2 1 4 3 1 3
## 3 3 1 4 1 3
## 4 2 4 2 3 4
## 5 3 1 3 2 2
## 6 1 3 1 1 4
sjPlot : Solution

Création des étiquettes et des éléments


> labels <- c(“Strongly agree”, “Agree”, “Disagree”, + “Strongly
disagree”)
> items <- c(“Question 1”, “Question 2”, “Question 3”,
“Question 4”, “Question 5”)
sjPlot : Solution
sjPlot : Solution

plot_likert(df, axis.labels = items, legend.labels = labels,


geom.size = 0.4)
sjPlot : Solution

24.4 25.6
Question 1 (n=500)
24.4 25.6

25.2 27.8
Question 2 (n=500)
22.6 24.4

Strongly agree
25.6 22.2 Agree
Question 3 (n=500)
25.0 27.2 Disagree
Strongly disagree

25.4 23.2
Question 4 (n=500)
24.6 26.8

20.6 27.2
Question 5 (n=500)
25.4 26.8

100% 80% 60% 40% 20% 0% 20% 40% 60% 80% 100%
sjPlot : Solution

df <- data.frame(
question1 = as.factor(sample(1:4, 500, replace = TRUE,
prob=c(0.18,0.5,0.2,0.12))),
question2 = as.factor(sample(1:4, 500, replace = TRUE,
prob=c(0.32,0.18,0.28,0.22))),
question3 = as.factor(sample(1:4, 500, replace = TRUE,
prob=c(0.6,0.3,0.1,0.1))),
question4 = as.factor(sample(1:4, 500, replace = TRUE,
prob=c(0.4,0.4,0.15,0.05))),
question5 = as.factor(sample(1:4, 500, replace = TRUE,
prob=c(0.05,0.05,0.35,0.55)))
)
sjPlot : Solution

20.6 48.2 18.4


Question 1 (n=500)
12.8

22.2 30.0
Question 2 (n=500)
14.2 33.6

Strongly agree
11.4 27.6 Agree
Question 3 (n=500)
9.4 51.6 Disagree
Strongly disagree

15.8
3.8 40.2 40.2
Question 4 (n=500)

3.6
Question 5 (n=500)
52.8 40.0 3.6

100% 80% 60% 40% 20% 0% 20% 40% 60% 80% 100%
Corrplot : représentation de la matrice de corrélation

> library(corrplot) > data(mtcars) > head(mtcars) > M <-


cor(mtcars) > corrplot(M, method = “circle”)
Corrplot : représentation de la matrice de corrélation

## corrplot 0.92 loaded

## mpg cyl disp hp drat wt qsec vs


## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1
Corrplot : représentation de la matrice de corrélation

qsec

gear
mpg

carb
disp

drat

am
cyl

hp

wt

vs
1

mpg 1.00 −0.85 −0.85 −0.78 0.68 −0.87 0.42 0.66 0.60 0.48 −0.55

0.8
cyl −0.85 1.00 0.90 0.83 −0.70 0.78 −0.59 −0.81 −0.52 −0.49 0.53

0.6
disp −0.85 0.90 1.00 0.79 −0.71 0.89 −0.43 −0.71 −0.59 −0.56 0.39

0.4
hp −0.78 0.83 0.79 1.00 −0.45 0.66 −0.71 −0.72 −0.24 −0.13 0.75

drat 0.68 −0.70 −0.71 −0.45 1.00 −0.71 0.09 0.44 0.71 0.70 −0.09 0.2

wt −0.87 0.78 0.89 0.66 −0.71 1.00 −0.17 −0.55 −0.69 −0.58 0.43 0

qsec 0.42 −0.59 −0.43 −0.71 0.09 −0.17 1.00 0.74 −0.23 −0.21 −0.66 −0.2

vs 0.66 −0.81 −0.71 −0.72 0.44 −0.55 0.74 1.00 0.17 0.21 −0.57
−0.4

am 0.60 −0.52 −0.59 −0.24 0.71 −0.69 −0.23 0.17 1.00 0.79 0.06
−0.6

gear 0.48 −0.49 −0.56 −0.13 0.70 −0.58 −0.21 0.21 0.79 1.00 0.27
−0.8

carb −0.55 0.53 0.39 0.75 −0.09 0.43 −0.66 −0.57 0.06 0.27 1.00
−1
Corrplot : représentation de la matrice de corrélation

qsec

gear
mpg

carb
disp

drat

am
cyl

hp

wt

vs
1

mpg

0.8
cyl

0.6
disp

0.4
hp

drat 0.2

wt 0

qsec −0.2

vs
−0.4

am
−0.6

gear
−0.8

carb
−1
Cartographie (consulter TP5)

Vous aimerez peut-être aussi