Intro R
Intro R
1
> install.packages("ISwR")
Pour désinstaller un package
> remove.packages("ISwR")
Some packages are already loaded when R starts up. At any
point, The list of currently loaded packages can be listed by the
search function:
> search()
[1] ".GlobalEnv" "package:lattice"
[3] "package:tools" "package:methods"
[5] "package:stats" "package:graphics"
[7] "package:grDevices" "package:utils"
[9] "package:datasets" "Autoloads"
[11] "package:base
Other packages can be loaded by the user. We will be interested
in the ISwR package, which contains the datasets used in the
text. This can be loaded by:
> library(base)
> library( help="base") # List the functions provided in " ISwR "
Most R packages provide some data sets as well as functions. Use function data() to see the
data sets that are loaded by default. Data sets have help pages. For example the page describing
the structure and variables of the data set named swiss is displayed by help(swiss). To get
information on the data sets that are included in the datasets
package, specify:
data(package="datasets") # Specify ’package’, not ’library’.
Replace "datasets" by the name of any other installed package
4- Commandes de base
R : pour démarrer R en interactif.
q() : pour quitter.
help(solve) ou ?solve : pour avoir de l'aide sur solve.
help.start() : lance un navigateur pour l'aide en html.
help.search('chi') pour chercher dans l'aide avec la partie de mot clef 'chi'.
example(solve) : pour faire tourner les exemples de la doc de solve.
demo(package = "stats") : liste des démos du package stats.
demo(nlm, package = "stats") : fait tourner la démo de nlm sur le package stats)
2) Arithmetic
R uses the usual symbols for addition +, subtraction -, multiplication *, division
/, and exponentiation ^. R calculates to a high precision, mais par défaut les nombres sont
arrondis à 7 chiffres après la virgule. You can change the display to x digits
using options(digits = x).
R has a number of built-in functions, for example sin(x), cos(x), tan(x),
(all in radians), exp(x), log(x), and sqrt(x). Some special constants such
as pi are also predefined.
> exp(1)
[1] 2.718282
> options(digits = 16)
> exp(1)
2
[1] 2.718281828459045
> pi
[1] 3.141592653589793
> round(pi,digits=3)
[1] 3.142
3) Vecteurs
Data vectors can be made with the c () function, which combines its arguments. The whale data
can be entered, as follows:
> whales = c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)
The values are separated by a comma. Once stored, the values can be printed by typing the
variable name
> whales
[1] 74 122 235 111 292 111 211 133 156 79
The [1] refers to the first observation. If more than one row is output, then this number
refers to the first observation in that row.
The c () function can also combine data vectors. For example:
> x = c(74, 122, 235, 111, 292)
> y = c(111, 211, 133, 156, 79)
> c(x,y)
[1] 74 122 235 111 292 111 211 133 156 79
Data vectors have a type One restriction on data vectors is that all the values have the
same type. This can be numeric, as in whales, characters strings, as in
> simpsons = c("Homer",’Marge’,"Bart","Lisa","Maggie")
or one of the other types we will encounter. Character strings are made with matching
quotes, either double, ", or single,’. If we mix the type within a data vector, the data will be
coerced into a common type, which is usually a character.
Les données peuvent être de nature différente et R les classe en différents groupes appelés modes
:
– numeric (les valeurs numériques qui peuvent être de type différent : integer et double)
– logical (les booléens vrai/faux)
– complex (les complexes)
– character (les caractères)
On peut connaître la nature d’une variable en utilisant la commande mode() et son type avec
typeof().
Les fonctions suivantes permettent de vérifier le type d’une donnée ou d’une variable et
d’effectuer
des opérations de conversion :
Test Conversion
is.numeric() as.numeric()
is.complex() as.complex()
is.character() as.character()
is.logical() as.logical()
3
Character vectors
Data, reports and figures require frequent manipulation of characters. Character strings are
delineated by double or single quotes. Here is an example:
> (s <- c("Florida; a politician's","nightmare"))
[1] "Florida; a politician's" "nightmare"
The vector s has two elements. To create a single string from s[1] and s[2],
we paste() them:
> paste(s[1], s[2])
[1] "Florida; a politician's nightmare"
By default, paste() separates its arguments with a space. If you want a different
character for spacing elements of characters, use the argument sep:
> paste(s[1], s[2], sep = '-')
[1] "Florida; a politician's-nightmare"
Giving data vectors named entries A data vector can have its entries named. These
will show up when it is printed. The names () function is used to retrieve and set values
for the names. This is done as follows:
> names(simpsons) = c("dad","mom","son","daughter
+ 1","daughter 2")
> names(simpsons)
[1] "dad" "mom" "son" "daughter 1"
[5] “daughter 2"
> simpsons
dad mom son daughter 1 daughter 2
"Homer" "Marge" "Bart" "Lisa" "Maggie"
Pour supprimer les noms :
> names(simpsons) <- NULL
Accessing by names In R, when the data vector has names, then the values can be
accessed by their names. This is done by using the names in place of the indices. A
simple example follows:
> x = 1:3
> names(x) = c("one","two","three") # set the names
> x["one"]
one
1
Using data.entry () to edit data : data.entry (x) will allow us to edit the data vector x. The
function does not make a new variable. To use data.entry() to make a new variable, we can first
create a simple one, as we have done below, and then finish the data entry with the spreadsheet.
> x <-c(1)
> data.entry(x)
It is also possible, if one wants to enter some data on the keyboard, to use
the function scan with simply the default options:
> x <- scan()
1: 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
10:
Read 9 items
>x
4
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Simple sequences
> v <- seq (0,1,by=0.1)
[1] 0.0 0.1 0.2 …………..0.9 1.0
> v <- seq (0, 1, length= 11) # 11 numbers only
[1] 0.0 0.1 1.0 # 11 valeurs entre 0 et 1
> seq (10)
[1] 1 2 3 4 5 6 7 8 9 10
> v <- 1:3 # les composantes de v sont 1,2,3
Repeated numbers When a vector of repeated values is desired, the rep() function is
used.
[1] 1 1 1 1 1 1 1 1 1 1
> rep(1:3,3) # ou rep(1:3,times=3)
[1] 1 2 3 1 2 3 1 2 3
> rep(1:3,each=3)
[1] 1 1 1 2 2 2 3 3 3
> rep(c("long","short"),c(1,2)) # 1 long and 2 short
[1] "long" "short" "short"
> sum(v) # somme des composantes de v
> sd(v) #écart type de v
> sum(v[1 :3]) # somme des 3 premières composantes de v
> v[c(1,4)] # les composantes de rang 1 et 4
> v[-c(1,4)] # enlève les composantes de rang 1 et 4
> x [1] <- 5 # change la valeur de la composante de rang 1
Si vous voulez remplacer la valeur 1 par la valeur 25, vous utiliserez alors la ligne
de commande suivante :
> x[x= =1] <- 25
> x[c (1, 4)]<- c(20, 30) # les composantes de rang 1 et 4 ont pour valeur respectivement 20 et
30
> v[v>2] # les composantes dont les valeurs sont supérieures à 2
Si vous disposez de deux vecteurs ayant le même nombre de composantes, vous
pouvez demander à chercher les valeurs de l'un pour lesquelles les valeurs de l'autre
sont supérieures (ou inférieures) à une certaine valeur. Par exemple, les vecteurs x
et y sont composées de 5 valeurs. Vous pouvez demander d'extraire de y les valeurs
de y pour lesquels x est supérieur à 4 en utilisant la ligne de commande suivante :
> y[x>4]
> x<-1:5 ; > y<-10:14
> x[y>12]
[1] 4 5
max(v)
which(v= = max(v)) # indice dans v qui correspond à max(v)
length(v) # nombre de composantes de v
cumsum(a) # somme cumulative de 'a'
> cumsum(c(1,3,5))
[1] 1 4 9
cumprod(b) # produit cumulatif de 'b'
> cumprod(c(1,3,5))
5
[1] 1 3 15
Exercice
If you want to create a sequence of the same length as an existing vector, then use along
like this.
> x<-10:20
> seq(along = x)
[1] 1 2 3 4 5 6 7 8 9 10 11
> seq(88,50,along=x)
[1] 88.0 84.2 80.4 76.6 72.8 69.0 65.2 61.4 57.6 53.8 50.0
Creating a Vector
Named Elements within Vectors
Working with Vectors and Logical Subscripts
Take the example of a vector containing the 11 numbers 0 to 10:
x<-0:10
There are two quite different kinds of things we might want to do with this. We might want
to add up the values of the elements:
sum(x)
[1] 55
Alternatively, we might want to count the elements that passed some logical criterion.
Suppose we wanted to know how many of the values were less than 5:
sum(x<5)
[1] 5
Ce qui remplace
> length(x[x>5])
[1] 5
You see the distinction. We use the vector function sum in both cases. But sum(x) adds
up the values of the xs and sum(x<5) counts up the number of cases that pass the logical
condition ‘x is less than 5’. Logical TRUE has been coerced to numeric 1 and logical FALSE has
been coerced to numeric 0.
To find the sum of the values of x that are less than 5, we write:
sum(x[x<5])
[1] 10
Let’s look at this in more detail. The logical condition x<5 is either true or false:
x<5
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[10] FALSE FALSE
You can imagine false as being numeric 0 and true as being numeric 1. Then the vector of
subscripts [x<5] is five 1s followed by six 0s:
1*(x<5)
[1] 1 1 1 1 1 0 0 0 0 0 0
Now imagine multiplying the values of x by the values of the logical vector
x*(x<5)
[1] 0 1 2 3 4 0 0 0 0 0 0
When the function sum is applied, it gives us the answer we want: the sum of the values
of the numbers 0+1+2+3+4=10.
sum(x*(x<5))
6
[1] 10
Exercise :
Suppose we want to work out the sum of the three largest values in a vector. There are
two steps: first sort the vector into descending order. Then add up the values of the first
three elements of the sorted vector. Let’s do this in stages. First, the values of y:
y<-c(8,3,5,7,6,6,8,9,2,3,9,4,10,4,11)
Now if you apply sort to this, the numbers will be in ascending sequence, and this makes
life slightly harder for the present problem:
sort(y)
[1] 2 3 3 4 4 5 6 6 7 8 8 9 9 10 11
We can use the reverse function, rev like this (use the Up arrow key to save typing):
rev(sort(y))
[1] 11 10 9 9 8 8 7 6 6 5 4 4 3 3 2
So the answer to our problem is 11+10+9=30. But how to compute this? We can use
specific subscripts to discover the contents of any element of a vector. We can see that 10
is the second element of the sorted array. To compute this we just specify the subscript [2]:
rev(sort(y))[2]
[1] 10
A range of subscripts is simply a series generated using the colon operator. We want the
subscripts 1 to 3, so this is:
rev(sort(y))[1:3]
[1] 11 10 9
So the answer to the exercise is just
sum(rev(sort(y))[1:3])
[1] 30
Ce qui remplace
sum(sort(y)[(length(y)-2):length(y)])
Exercise :
To extract every nth element from a long vector we can use seq as an index. In this case
I want every 25th value in a 1000-long vector of normal random numbers with mean value
100 and standard deviation 10:
xv<-rnorm(1000,100,10)
xv[seq(25,length(xv),25)]
[1] 100.98176 91.69614 116.69185 97.89538 108.48568 100.32891 94.46233
[8] 118.05943 92.41213 100.01887 112.41775 106.14260 93.79951 105.74173
[15] 102.84938 88.56408 114.52787 87.64789 112.71475 106.89868 109.80862
7
[22] 93.20438 96.31240 85.96460 105.77331 97.54514 92.01761 97.78516
[29] 87.90883 96.72253 94.86647 90.87149 80.01337 97.98327 92.77398
[36] 121.47810 92.40182 87.65205 115.80945 87.60231
It is often useful to have the values in a vector labelled in some way. For instance, if our
data are counts of 0, 1, 2, … occurrences in a vector called counts (ou effectif)
> (counts<-c(25,12,7,4,6,2,1,0,2))
[1] 25 12 7 4 6 2 1 0 2
so that there were 25 zeros, 12 ones and so on, it would be useful to name each of the
counts with the relevant number 0 to 8:
> names(counts)<-0:8
Now when we inspect the vector called counts we see both the names and the frequencies:
counts
012345678
25 12 7 4 6 2 1 0 2
If you have computed a table of counts, and you want to remove the names, then use the
as.vector function like this:
8
k
> (st<-table(rpois(2000,2.3))) # P[ X k ] e
k!
0123456789
205 455 510 431 233 102 43 13 7 1
names(st)
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
> as.vector(st)
[1] 205 455 510 431 233 102 43 13 7 1
Exercice
Suppose our task is to calculate a trimmed mean of x which ignores both the smallest
and largest values (i.e. we want to leave out the 1 and the 8 in this example). There are two
steps to this. First, we sort the vector x. Then we remove the first element using x[-1] and
the last using x[-length(x)]. We can do both drops at the same time by concatenating both
instructions like this: -c(1,length(x)). Then we use the built-in function mean:
trim.mean <- function (x) mean(sort(x)[-c(1,length(x))])
Now try it out. The answer should be mean(c(5,6,7,5,3)) = 26/5 = 5.2:
trim.mean(x)
[1] 5.2
Exemples
>x
[1] 5 8 6 7 1 5 3
> x[x!=5]<-0
9
>x
[1] 5 0 0 0 0 5 0
> 10%%2
[1] 0
> x[x%%2= =0]
[1] 8 6
> 31%/%4
[1] 7
> x[!x<5]<-0
>x
[1] 0 0 0 0 1 0 3
4) Les fonctions
Custom functions to extend the R language can be created using the function
keyword. For example a function to calculate l’écart type of a sample
vector x could be defined as follows:
> se<-function(x){
n<-length(x)
xbar<-mean(x)
sqrt((sum((x-xbar)^2))/(n-1))}
The function arguments are declared as the arguments to the function keyword.
Here there is just one argument, named x. The value returned by
the function is the value of its final line. The name of the function is the name of
the variable you assign it to. Here the function is named se. This function could
then be used as follows:
> y<-rnorm(100)
> se(y)
Les fonctions mathématiques dans R
a<-0; ifelse (a>=1, b<-"oui", b<-"non"); b
log(x) log to base e of x
exp(x) antilog of x _ex_
log(x,n) log to base n of x
log10(x) log to base 10 of x
sqrt(x) square root of x
factorial(x) x!
choose(n,x) binomial coefficients n!/(x! (n−x)!)
floor(x) greatest integer <=x
ceiling(x) smallest integer >=x
trunc(x) closest integer to x between x and 0 trunc(1.5) =1, trunc(-1.5)
=−1 trunc is like floor for positive values and like ceiling for
negative values
round(x, digits=0) round the value of x to an integer
runif(n) generates n random numbers between 0 and 1 from a uniform
distribution
cos(x) cosine of x in radians
sin(x) sine of x in radians
tan(x) tangent of x in radians
acos(x), asin(x), atan(x) inverse trigonometric transformations of real or complex numbers
10
acosh(x), asinh(x), atanh(x) inverse hyperbolic trigonometric transformations of real or
complex numbers
abs(x) the absolute value of x
unique () : comme son nom l'indique, enlève les doublons d'un vecteur.
> unique(c(1,3,6,2,7,4,8,1,0))
[1] 1 3 6 2 7 4 8 0
11
Infinity and Things that Are Not a Number (NaN)
Calculations can lead to answers that are plus infinity, represented in R by Inf, or minus
infinity, which is represented as -Inf:
> 3/0
[1] Inf
> -12/0
[1] -Inf
Calculations involving infinity can be evaluated: evaluated: for instance,
> exp(-Inf)
[1] 0
> 0/Inf
[1] 0
Other calculations, however, lead to quantities that are not numbers. These are represented
in R by NaN (‘not a number’). Here are some of the classic cases:
> 0/0
[1] NaN
> Inf-Inf
[1] NaN
> Inf/Inf
[1] NaN
> is.finite(10)
[1] TRUE
> is.infinite(10)
[1] FALSE
> is.infinite(Inf)
[1] TRUE
5) Missing values NA
Missing values in dataframes are a real source of irritation because they affect the way that
model-fitting functions operate and they can greatly reduce the power of the modelling that
we would like to do.
Some functions do not work with their default settings when there are missing values in
the data, and mean is a classic example of this:
> x<-c(1:8,NA)
> mean(x)
[1] NA
In order to calculate the mean of the non-missing values, you need to specify that the
NA are to be removed, using the na.rm=TRUE argument:
> mean(x,na.rm=T)
[1] 4.5
To check for the location of missing values within a vector, use the function is.na(x)
rather than x !="NA". Here is an example where we want to find the locations (7 and 8) of
missing values within a vector called vmv:
> vmv
[1] 1 2 3 4 5 6 NA NA 9 10 11 12
> is.na(vmv)
12
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
FALSE
Making an index of the missing values is achieved using which like this:
> which(is.na(vmv))
[1] 7 8
or
> which(is.na(vmv)==T)
[1] 7 8
On obtient les composantes de vmv dont les indices (ou rangs) vérifient la condition
is.na(vmv)==T.
If the missing values are genuine counts of zero, you might want to edit the NA to 0.
Use the is.na function to generate subscripts for this
> vmv[is.na(vmv)]<- 0
> vmv
[1] 1 2 3 4 5 6 0 0 9 10 11 12
C’est la même chose que pour vmv(vmv<4).
or use the ifelse function like this
vmv<-c(1:6,NA,NA,9:12)
ifelse(is.na(vmv),0,vmv)
[1] 1 2 3 4 5 6 0 0 9 10 11 12
> sum(is.na(vmv)) # nombre de NA
Exercice
Comment vérifier que les positions des valeurs manquantes sont les mêmes pour deux vecteurs ?
> x<-c(1,4,6,NA,4,NA)
> y<-c(4,3,2,NA,5,NA)
On écrit l’instruction suivante :
> all(is.na(x)==is.na(y))
[1] TRUE
Si on ignore les valeurs manquantes, les deux vecteurs sont différents. En effet
> all(x[!is.na(x)]==y[!is.na(y)])
[1] FALSE
6) Fonctions d’un vecteur
One of R’s great strengths is its ability to evaluate functions over entire vectors, thereby
avoiding the need for loops and subscripts. Important vector functions are listed in Table 2.2.
Table Vector functions used in R.
Operation Meaning
max(x) maximum value in x
min(x) minimum value in x
sum(x) total of all the values in x
mean(x) arithmetic average of the values in x
median(x) median value in x
range(x) vector of min_x_ and max_x_
var(x) sample variance of x
cor(x,y) correlation between vectors x and y
sort(x) a sorted version of x
rank(x) vector of the ranks of the values in x
order(x) an integer vector containing the permutation to sort x into ascending order
13
quantile(x) vector containing the minimum, lower quartile, median, upper quartile, and
maximum of x
quantile(x,prob=1:3/4) donne les trois quartile Q1, Q2 et Q3.
cumsum(x) vector containing the sum of all of the elements up to that point
cumprod(x) vector containing the product of all of the elements up to that point
cummax(x) vector of non-decreasing numbers which are the cumulative maxima of the values
in x up to that point
cummin(x) vector of non-increasing numbers which are the cumulative minima of the values
in x up to that point
pmax(x,y,z) vector, of length equal to the longest of x_ y or z, containing the maximumof x, y
or z for the ith position in each
pmin(x,y,z) vector, of length equal to the longest of x_ y or z, containing the minimum
of x,y or z for the ith position in each
colMeans(x) column means of dataframe or matrix x
colSums(x) column totals of dataframe or matrix x
rowMeans(x) row means of dataframe or matrix x
rowSums(x) row totals of dataframe or matrix x
14
The option replace=T allows for sampling with replacement. The vector produced by the sample
function with replace=T is the same length as the vector sampled, but some values are left out at
random and other values, again at random, appear two or more times. In this sample, 10 has been
left out, and there are now three 9s:
sample(y,replace=T)
[1] 9 6 11 2 9 4 6 8 8 4 4 4 3 9 3
In this next case, the are two 10s and only one 9:
sample(y,replace=T)
[1] 3 7 10 6 8 2 5 11 4 6 3 9 10 7 4
More advanced options in sample include specifying different probabilities with which
each element is to be sampled (prob=). For example, if we want to take four numbers at
random from the sequence 1:10 without replacement where the probability of selection (p)
is 5 times greater for the middle numbers (5 and 6) than for the first or last numbers, and
we want to do this five times, we could write
p <- c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1)
x<-1:10
sapply(1:5,function(i) sample(x,4,prob=p))
[,1] [,2] [,3] [,4] [,5]
[1,] 8 7 4 10 8
[2,] 7 5 7 8 7
[3,] 4 4 3 4 5
[4,] 9 10 8 7 6
so the four random numbers in the first trial were 8, 7, 4 and 9 (i.e. column 1).
8) factor
Type de vecteur qui code pour une propriété qualitative (attribut nominal) qui est codé en
interne par un numéro et non par la chaîne de caractère représentant sa valeur. A factor is stored
internally as a numeric vector with values 1, 2, 3, …, k. The value k is
the number of levels.
Statisticians typically recognise three basic types of variable: numeric, ordinal,
and categorical. Both ordinal and categorical variables take values from
some finite set, but the set is ordered for ordinal variables. For example in an
experiment one might grade the level of physical activity as low, medium, or
high, giving an ordinal measurement. An example of a categorical variable is
hair colour. In R the data type for ordinal and categorical vectors is factor.
The possible values of a factor are referred to as its levels.
To create factors in R, use the function factor(). However, many operations on data in R create
factors by default.
> fac <- factor(c("rouge", "vert", "rouge", "bleu", "vert"))
> fac
[1] rouge vert rouge bleu vert
Levels: bleu rouge vert
> levels(fac)
[1] "bleu" "rouge" "vert"
> as.numeric(fac)
[1] 2 3 2 1 3
The function as.numeric extracts the numerical coding as numbers
15
1–3 and levels extracts the names of the levels.
Exercice
If x is a factor with n levels and y is a length n vector, what happens if you compute y[x]?
Réponse
Factor x gets treated as if it contained the integer codes.
x <- factor(c("Huey", "Dewey", "Louie", "Huey"))
y <- c("blue", "red", "green")
x
y[x]
y[as.numeric(x)]
Autre exemple
> Sch<-sample(0:1,20,replace=T) # échantillon de 20 nombres pris à partir du vecteur 0-1
> Sch
[1] 0 0 1 0 0 0 1 0 1….
> Sch.f<-factor(Sch,labels=c("private","public"))
> Sch.f
[1] private public public private private public private public public
[10] private public public public public public public private private
[19] private public
Levels : private public
Autre exemple :
> hair <- c("blond", "black", "brown", "brown", "black", "gray","none")
> is.character(hair)
[1] TRUE
> is.factor(hair)
[1] FALSE
> hair <- factor(hair)
> levels(hair)
[1] "black" "blond" "brown" "gray" "none"
> hair <- factor(hair, levels = c("black", "gray", "brown", "blond", "white", "none"))
> table(hair)
hair
black gray brown blond white none
212101
Note the use of the function table to calculate the number of times each level of the factor
appears. table can be applied to other modes of vectors as well as factors.
To create an ordered factor we just include the option ordered = TRUE in the factor command. In
this case it is usual to specify the levels of the factor yourself, as that determines the ordering.
> phys.act <- c("L", "H", "H", "L", "M", "M")
> phys.act <- factor(phys.act, levels = c("L", "M", "H"), ordered = TRUE)
Autre possibilité :
> phys.act <-as.orderd(phys.act)
Levels(y)<- c("L", "M", "H")
> is.ordered(phys.act)
[1] TRUE
> phys.act[2] > phys.act[1]
16
[1] TRUE
Often abbreviations or numerical codes are used to represent the levels of a
factor. You can change the names of the levels using the labels argument.
If you do this then it is good practice to specify the levels too, so you know
which label goes with which level.
> phys.act <- factor(phys.act, levels = c("L", "M", "H"),
+ labels = c("Low", "Medium", "High"), ordered = TRUE)
> table(phys.act)
phys.act
Low Medium High
222
> which(phys.act == "High")
[1] 2 3
9) list
Une liste est un type de vecteur spécial dont les éléments peuvent être de
n’importe quel mode, y compris le mode list (ce qui permet d’emboîter des
listes). La fonction de base pour créer des listes est list. Il est généralement préférable de nommer
les éléments d’une liste. Il est en effet plus simple et sûr d’extraire les éléments par leur étiquette.
L’extraction des éléments d’une liste peut se faire de deux façons :
1. avec des doubles crochets [[ ]] ;
2. par leur étiquette avec nom.liste$etiquette.element.
Accès aux noms des éléments de la liste : names(lis). On peut aussi modifier les noms des
éléments en faisant une affectation : names(lis) <- c("f", "l", "a", "c")
Exemple :
> lis <- list(firstname = "jean", lastname = "dupond", age = 35, childAges = c(3, 5, 9))
> lis[[4]]
[1] 3 5 9
> lis$age
[1] 35
> names(lis)
[1] "firstname" "lastname" "age" "childAges"
> names(lis) <- c("f", "l", "a", "c")
> lis
$f
[1] "jean"
$l
[1] "dupond"
$a
[1] 35
$c
[1] 3 5 9
Attention : lis[1] renvoie une liste composée d'un seul élément, le premier
> lis[1]
17
$f
[1] "jean"
10) matrix
X <- matrix (1 :12, nrow=4, ncol=3, byrow=TRUE)
Ajouter des noms aux lignes et colonnes
> mat <- matrix ( c(3,2,…), nrow=3, dimnames = list (c("A","B","C"), c("a","b","c")))
> dimnames(mat)[[1]] <- letters[1:3] : les noms des lignes deviennent a b c
ou > rownames (mat) <- c("a","b","c")
class(X)
[1] "matrix"
> dim(X)
[1] 4 3
Changer une matrice
> mat [1, 2] <-5 : change la valeur d’un élément se situant à la 1ère ligne-2ème colonne
> mat [1, ] <- c(5,6,7) : change la 1ère ligne
> round(mat, 2) : les composantes sont arrondis à 2 décimales
> mat [1,2] : extraction de l’élément se situant à la 1ère ligne-2ème colonne
> mat[ , 2] : extraction de la 2ème colonne
> mat[cbind(c(1,2),c(3,5)] : extraction des éléments d’indices (1,2) et (3,5)
> which(m= =1) : les éléments =1
> which(m= =1, arr.ind=TRUE) : récupération des indices des composantes égales à 1.
Transformer une matrice en vecteur
> as.vector (X)
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> cbind(vect1,vect2,vect3) : renvoie une matrice dont les vecteurs colonnes sont les
vecteurs vect1, vect2 et vect3.
> m2 <- cbind(1,1:4)
> m2
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 1 4
> rbind <- idem mais avec des vecteurs lignes.
Pour insérer une ligne ou une colonne, on peut utiliser rbind et cbind :
> rbind(mat[1:2,], NA,mat[3:4,]) # Insérer une ligne
> cbind(mat[1,], NA,mat[2:3,]) # Insérer une colonne
Autre possibilités pour créer une matrice
> vector<-c(1,2,3,4,4,3,2,1)
> V<-matrix(vector,byrow=T,nrow=2)
>V
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 4 3 2 1
Ou aussi avec :
> dim(vector)<-c(4,2)
18
Pour vérifier que l’on obtient ainsi une matrice on écrit :
> is.matrix(vector)
[1] TRUE
> vector
[,1] [,2]
[1,] 1 4
[2,] 2 3
[3,] 3 2
[4,] 4 1
Soit la matrice :
X<-matrix(rpois(20,1.5),nrow=4)
On veut ajouter les noms des lignes à savoir trial.1,trial.2,trial.3,trial.4. On écrit alors
rownames(X)<-paste("Trial.",1:4,sep=" ")
On obtient
>X
[,1] [,2] [,3] [,4] [,5]
Trial.1 2 2 1 1 3
Trial.2 1 0 2 2 1
Trial.3 2 3 0 2 2
Trial.4 1 0 4 1 1
De même pour les noms des colonnes.
Si on veut que les noms soient row1,row2,… on écrit
rownames(X)<-rownames(X,do.NULL=FALSE)
[,1] [,2] [,3] [,4] [,5]
row1 2 2 1 1 3
row2 1 0 2 2 1
row3 2 3 0 2 2
row4 1 0 4 1 1
On peut assi utiliser la fonction dimnames qui a pour argument list (rows first, columns second)
> dimnames(X)<-list(NULL,paste("drug.",1:5,sep=""))
>X
drug.1 drug.2 drug.3 drug.4 drug.5
[1,] 2 2 1 1 3
[2,] 1 0 2 2 1
[3,] 2 3 0 2 2
[4,] 1 0 4 1 1
Agrégats sur les matrices
> rowSums (mat) (ou colSums (mat)) : renvoie les vecteurs contenant la somme des lignes
(ou la somme des colonnes).
> rowMeans (mat) (ou colMeans (mat))
On utiise na.rm = TRUE pour éviter de tenir compte des valeurs NA.
Produit de matrice (n,p) ∗ (p,q) = (n,q)
> A%∗%B
Produit scalaire xty où x et y deux vecteurs ayant le même nombre de composantes.
> x%∗%y
19
Produit élément par élément
> A ∗B
où A et B sont deux vecteurs (ou deux matrices) de même « dimension ».
Transposée d’une matrice
> t(X)
Matrice identité
> I <- diag(n))
Inverse d’une matrice
> solve (X)
Diagonale d’une matrice
> diag (mat) : renvoie un vecteur correspondant à la diagonale de mat
> diag (k,n) : renvoie une matrice diagonale de dimension n et dont les éléments diagonaux
sont égaux à k
Exercice : calculer le produit de deux matrices
f<-function(M,N){
m<-dim(M)[2];n<-dim(N)[1]
if (m!=n) stop("dsd")
else
p<-dim(M)[1];q<-dim(N)[2]
mat<-matrix(NA,p,q)
for (i in 1:p){
for (j in 1:q) {
mat[i,j]<-M[i,]%*%N[,j]
}
}
return(mat)
}
M<-matrix(1,2,3)
N<-matrix(2,3,2)
f(M,N)
M%*%N
20
VF
13
24
Table 1.1
To read this file, type
exemple<-read.table("exemple.txt",header=T)
or
exemple<-read.table("C:/livre/logiciel_R/cours_R/exemple.txt",header=T)
We have assumed that the fields in exemple.txt are separated by spaces (or tabs), as
allowed by the default setting (sep=" ") for read.table().
Notice header=T specifying that the first line is a header containing
the names of variables contained in the file. Also note that you use forward
slashes (/), not backslashes (\), in the filename. You could have used expemple<-
read.table("C:\\livre\\logiciel_R\\cours_R\\exemple.txt",sep="",header=T)
There is one commonly used variant of read.table. read.csv(file) is
for comma-separated data (pour lire des donnees separées par des virgules) and is equivalent to
read.table(file, header = TRUE, sep = ","). read.csv2(file) est utilisé pour lire des données
separées par des points-virgules. read.delim(file) is for tab-delimitated data (données separées par
des tabulations) and is equivalent to read.table(file, header = TRUE, sep = "\t").
A dataframe is an object with rows and columns (a bit like a 2-dimensional matrix). The rows
contain different observations from your study, or measurements from your experiment. The
columns contain different variables. The values in the body of the dataframe can be numbers (as
they would be in as matrix), but they could also be text (e.g. the names of factor levels for
categorical variables, like “male” or “female” in a variable called “gender”), they could be
calendar dates (like 23/5/04), or they could be logical variables (like “TRUE” or “FALSE”). Here
is a dataframe with 7 variables, the left-most of which comprises the row names, and other
variables are numeric (Area, Slope, Soil pH and Worm density), categorical (Field Name and
Vegetation) or logical (Damp is either true = T or false = F).
21
Pound.Hill 4.4 2 Arable 4.5 F 5
Gravel.Pit 2.9 1 Grassland 3.5 F 1
Farm.Wood 0.8 10 Scrub 5.1 T 3
Perhaps the most important thing about analysing your own data properly is getting your
dataframe absolutely right. The expectation is that you will have used a spreadsheet like Excel to
enter and edit the data. Once you have made your dataframe in Excel and corrected all the
inevitable data-entry and spelling errors, then you need to save the dataframe in a file format that
can be read by R. Much the simplest way is to save all your dataframes from Excel as tab-
delimited text files: File / Save As / … then from the “Save as type” options choose “Text (Tab
delimited)”. There is no need to add a suffix, because Excel will automatically add “.txt” to your
file name. This file can then be read into R directly as a dataframe, using the read.table function.
It is important to note that read.table would fail if there were any spaces in any of the variable
names in row 1 of the dataframe (the header row) like Field Name, Soil pH or Worm Density.
We should replace all these spaces by dots “.” before saving the dataframe in Excel (use Edit
/Replace with “ “ replaced by “.”). Now the dataframe can be read into R. There are 3 things to
remember:
• the whole path and file name needs to be enclosed in double quotes: “c:\\abc.txt”
• header =T says that the first row contains the variable names
Think of a name for the data frame (say “worms” in this case).
worms<-read.table("c:\\temp\\worms.txt",header=T,row.names=1)
or
> worms<-
read.table("http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/worms.txt",header=T)
Pour ouvrir un fichier de données sans avoir à indiquer son emplacement en utilisant
une boîte de dialogue conviviale :
> read.table(file.choose())
22
> fr <- data.frame(age = c(15,20,16), nom = c("pierre",
"jeanne","karim"),sexe=c("Masculin","Féminin","Masculin"))
> fr
[1] age nom Sexe
1 15 pierre Masculin
2 20 jeanne Féminin
3 16 karim Masculin
On peut ajouter les noms des lignes (ou colonnes) par rownames( ) (ou colnames( )) :
> rownames(fr) = c("I1", "I2","I3")
Extraction d’une colonne par le nom de la colonne ou par son numéro
> fr[, 1] ou fr[,"age"] ou fr$age. On obtient un vecteur
[1] 15 20. On peut aussi écrire fr["age"] on obtient un dataframe avec une seule variable. Si on
veut sélectionner une partie du dataframe avec les variables nom et sexe on écrit fr[, c(2,3)] ou
fr[,c("nom","sexe")] ou fr[c("nom","sexe")].
Si on veut ajouter une variable note (ou colonne) à un data frame :
> fr["note"]<-c(14,2,10) ou > fr$note<-c(14,2,10)
> fr
age nom sexe note
1 15 pierre Masculin 14
2 20 jeanne Féminin 2
3 16 karim Masculin 10
> mean(age)
Erreur dans mean(age) : objet 'age' introuvable. Si on écrit
> attach(fr)
On peut maintenant calculer la moyenne sans utiliser la syntaxe $
> mean(age)
[1] 17
Si on veut obtenir un sous-ensemble pour les individus de sexe masculin, on écrit
fr[fr$sex=="Masculin",] ou fr[fr[,3]=="Masculin",] ou fr[fr[,"sexe"]=="Masculin",]
On obtient
age nom sexe
I1 15 pierre Masculin
I3 16 karim Masculin
head(fr) : renvoie les 6 premières lignes d'un frame.
head(fr, 10) : renvoie les 10 premières lignes d'un frame
tail(fr, 10) : renvoie les 10 dernières lignes d'un frame.
Autre exemple
Le tableau iris.f, est de type data.frame, il contient des données sur 3 espèces d'iris.
On dispose de 150 fleurs sur lesquelles on a mesuré 5 caractéristiques (taille du data.frame :
150,5).
> iris.f[c(1:5),]
Sepale.long Sepale.larg Petale.long Petale.larg Espece
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
23
5 5.0 3.6 1.4 0.2 setosa
> dim(iris.f)
[1] 150 5
> iris.f$Sepale.long[1:5]
[1] 5.1 4.9 4.7 4.6 5.0
> iris.f[[2]][1:5]
[1] 3.5 3.0 3.2 3.1 3.6
24
10 117 3.0 201 2
11 188 8.5 211 7
12 121 4.0 325 1
Rank
The prices themselves are in no particular sequence. The ranks column contains the value
that is the rank of the particular data point (value of Price), where 1 is assigned to the
lowest data point and length(Price) – here 12 – is assigned to the highest data point. So the
first element, Price=325, is the highest value in Price. You should check that there are 11
values smaller than 325 in the vector called Price. Fractional ranks indicate ties. There are
two 188s in Price and their ranks are 8 and 9. Because they are tied, each gets the average
of their two ranks _8+9_/2=8_5.
Sort
The sorted vector is very straightforward. It contains the values of Price sorted into ascending
order. If you want to sort into descending order, use the reverse order function rev like
this: y<-rev(sort(x)). Note that sort is potentially very dangerous, because it uncouples
values that might need to be in the same row of the dataframe (e.g. because they are the
explanatory variables associated with a particular value of the response variable). It is bad
practice, therefore, to write x<-sort(x), not least because there is no ‘unsort’ function.
Order
This is the most important of the three functions, and much the hardest to understand on
first acquaintance. The order function returns an integer vector containing the permutation
that will sort the input into ascending order. You will need to think about this one. The
lowest value of Price is 95. Look at the dataframe and ask yourself what is the subscript in
the original vector called Price where 95 occurred. Scanning down the column, you find it
in row number 9. This is the first value in ordered, ordered[1]. Where is the next smallest
value (101) to be found within Price? It is in position 6, so this is ordered[2]. The third
smallest Price (117) is in position 10, so this is ordered[3]. And so on.
This function is particularly useful in sorting dataframes, as explained on p. 113. Using
order with subscripts is a much safer option than using sort, because with sort the values
of the response variable and the explanatory variables could be uncoupled with potentially
disastrous results if this is not realized at the time that modelling was carried out. The
beauty of order is that we can use order(Price) as a subscript for Location to obtain the
price-ranked list of locations:
Location[order(Price)]
[1] Reading Staines Winkfield Newbury
[5] Bracknell Camberley Bagshot Maidenhead
[9] Warfield Sunninghill Windsor Ascot
> houses[order(Price),]
Location Price
9 Reading 95
6 Staines 101
10 Winkfield 117
12 Newbury 121
3 Bracknell 157
4 Camberley 162
5 Bagshot 164
8 Maidenhead 188
25
11 Warfield 188
2 Sunninghill 201
7 Windsor 211
1 Ascot 325
When you see it used like this, you can see exactly why the function is called order. If you
want to reverse the order, just use the rev function like this:
Location[rev(order(Price))]
[1] Ascot Windsor Sunninghill Warfield
[5] Maidenhead Bagshot Camberley Bracknell
[9] Newbury Winkfield Staines Reading
Sorting by several criteria is done simply by having several arguments to
Order
> aa<-data.frame(sexe=c("M","F","M","M","F","M"),age=c(22,54,44,15,41,40))
> aa[order(sexe,age),]
sexe age
5 F 41
2 F 54
4 M 15
1 M 22
6 M 40
3 M 44
One of the most important and useful vector functions to master is tapply. The ‘t’ stands
for ‘table’ and the idea is to apply a function to produce a table from the values in the
vector, based on one or more grouping variables (often the grouping is by factor levels).
This sounds much more complicated than it really is:
> data<-
read.table("http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/Daphnia.txt",header=T)
ou
> data<-read.table("c:\\temp\\daphnia.txt",header=T)
Read.table permet de lire un fichier dans un formzt table et le transforme en data frame.
> class(data)
[1] "data.frame"
> attach(data)
> names(data)
[1] "Growth.rate" "Water" "Detergent" "Daphnia"
Une fois on a utilisé attach on peut écrire
> Growth.rate au lieu de > data$Growth.rate
>head(data)
[1] Growth.rate Water Detergent Daphnia
1 2.919086 Tyne BrandA Clone1
2 2.492904 Tyne BrandA Clone1
3 3.021804 Tyne BrandA Clone1
4 2.350874 Tyne BrandA Clone2
5 3.148174 Tyne BrandA Clone2
26
6 4.423853 Tyne BrandA Clone2
The response variable is Growth.rate and the other three variables are factors. Suppose we want
the mean growth rate for each detergent:
> tapply(Growth.rate,Detergent,mean)
BrandA BrandB BrandC BrandD
3.88 4.01 3.95 3.56
This produces a table with four entries, one for each level of the factor called Detergent.
To produce a two-dimensional table we put the two grouping variables in a list. Here we
calculate the median growth rate for water type and daphnia clone:
tapply(Growth.rate,list(Water,Daphnia),median)
Clone1 Clone2 Clone3
Tyne 2.87 3.91 4.62
Wear 2.59 5.53 4.30
The first variable in the list creates the rows of the table and the second the columns
apply (X, MARGIN, FUN)
> apply (mat, 1, sum) : renvoie un vecteur avec la somme des lignes
> apply (mat, 1, mean)
> apply (mat, c(1,2), function (x) {ifelse(x ! = 0, 1,0)}) : renvoie une matrice de même
dimension dont les éléments sont 1 ou 0 selon que la valeur est non nulle ou nulle.
MARGIN = 1 : indique les lignes
MARGIN =2 : indique les colonnes
MARGIN = c(1,2) : indique les lignes et les colonnes
FUN : la fonction à appliquer
> identical (a,b)
[1] TRUE (si a=b) ou FALSE (si a ≠ b)
sapply
Use sapply to map a function to each column of a data frame. For example the provided iris data
set:
> sapply[iris,class] # Apply class to columns of iris
> sapply[iris[1:4], mean # Apply mean to columns 1:4
27
Here is what the file bonjour.txt looks like:
bonjour monsieur
Note that cat does not automatically write a newline after the expressions. If you want a newline
you must explicitly include the string \n.
R provides a number of ways to read data from a file, the most flexible of
which is the scan function. We use scan to read a vector of values from a file.
For this example the file ba.txt was created beforehand using a text editor,
and is stored in working directory ../coursR.
321 543
432 543
> data<-scan(file="ba.txt")
> data
[1] 321 543 432 543
> name <- c('Ira A', 'David A', 'Todd A')
> height <- c(5 + 4 / 12, 6 + 11 / 12, 5 + 11 / 12)
Next, we combine them into a data frame
> faculty <- data.frame(name, height)
We save the data frame to a file
> save(faculty, file = 'faculty.rda')
> load('faculty.rda')
> faculty
name height
1 Ira A 5.333333
2 David A 6.916667
3 Todd A 5.916667
Ainsi, il est possible (et fortement souhaitable) de créer plusieurs fichiers d'extension .rda : un
pour chaque projet sur lequel on doit travailler. Il fautalors créer ces fichiers d'extension .rda dans
des dossiers appropriés distincts. Par exemple, supposons que l'on travaille sur deux projets
statistiques différents : l'un en relation avec des automobiles et l'autre en relation avec le climat,
on pourra alors creer un dossier nomme Automobile contenant un fichier auto.rda et un autre
dossier nomrne Climat contenant un fichier nomme c1imat.rda (qui contiendront les objets R
correspondant a chacune des deux études).
La fonction save ( ) permet d'enregistrer un fichier d'environnement de travail et il faut utiliser la
fonction load ( ) pour en charger un existant.
28
16) Les dates
As.Date transforme une chaine de caractères en objet « date ».
x<-"2007-10-17"
d<-as.Date(x, "%Y-%m-%d")
> x;d
[1] "2007-10-17"
[1] "2007-10-17"
> str(x)
chr "2007-10-17"
> str(d)
Date[1:1], format: "2007-10-17"
The default format has year, then month, then day of month
> dd <- as.Date(c("2003-08-24","2003-11-23","2004-02-22","2004-05-03"))
> diff(dd)
Time differences in days
[1] 91 91 71
> as.Date("1/1/1960", format="%d/%m/%Y")
[1] "1960-01-01"
> as.Date("1:12:1960",format="%d:%m:%Y")
[1] "1960-12-01"
La syntaxe résumée :
%d jour du mois (01–31)
%m mois (01–12)
%Y année (4 chiffres)
%y année (2 chiffres) à eviter !
17) Exercices
Exercice 1
Écrire une expression R pour créer la liste suivante :
[[1]]
[1] 1 2 3 4 5
$data
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[[3]]
[1] 0 0 0
$test
[1] FALSE FALSE FALSE FALSE
b) Extraire les étiquettes de la liste.
c) Trouver le mode et la longueur du quatrième élément de la liste.
29
d) Extraire les dimensions du second élément de la liste.
e) Extraire les deuxième et troisième éléments du second élément de la
liste.
f) Remplacer le troisième élément de la liste par le vecteur 3:8.
Correction
Soit x le nom de la liste.
a) x<- list(1:5, data = matrix(1:6, 2, 3), numeric(3),
+ test = logical(4))
Ou
x<- list(1:5, data = matrix(1:6, 2, 3), numeric(3),
+ test = logical(4))
x<-list(1:5,data=matrix(1:6,nrow=2,ncol=3),rep(0,3),test=rep(FALSE,4))
b) > names(x)
c) > mode(x$test) ou > mode(aa[[4]])
> length(x$test)
d) > dim(x$data)
e) > x[[2]][c(2, 3)] ou x$data[c(2, 3)]
f) > x[[3]] <- 3:8
Exercice 2
Soit obs un vecteur contenant les valeurs suivantes :
> obs
[1] 3 9 2 2 1 1 7 13 9 14 4 16 6 7 4 3
[17] 9 8 3 12
Écrire une expression R permettant d’extraire les éléments suivants.
a) Le deuxième élément de l’échantillon.
b) Les cinq premiers éléments de l’échantillon.
c) Les éléments strictement supérieurs à 14.
d) Tous les éléments sauf les éléments en positions 6, 10 et 12.
Correction
a) > obs[2]
b) > obs[1:5]
c) > obs[obs > 14]
d) > obs[-c(6, 10, 12)]
Exercice 3
Soit mat une matrice 7×10 obtenue aléatoirement avec
> (mat <- matrix(sample(1:100, 70), 7, 10))
Écrire une expression R permettant d’obtenir les éléments demandés ci-dessous.
a) L’élément (4,3) de la matrice.
b) Le contenu de la sixième ligne de la matrice.
c) Les première et quatrième colonnes de la matrice (simultanément).
d) Les lignes de la matrice dont le premier élément est supérieur à 50.
Correction
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 81 89 76 77 35 41 3 96 26
[2,] 2 60 11 93 64 68 75 17 9 73
[3,] 98 90 28 46 24 69 1 84 61 8
30
[4,] 22 6 13 29 78 47 19 30 38 85
[5,] 72 95 52 94 79 82 48 10 57 18
[6,] 44 40 39 21 83 43 14 33 91 45
[7,] 12 86 23 49 67 65 5 97 55 34
a) > mat[4, 3]
b) > mat[6, ]
c) > mat[, c(1, 4)]
d) > which(mat[, 1] > 50)
[1] 3 5
> mat[c(3,5),]
Autre possibilité
> mat[mat[, 1] > 50, ]
Exercice 4
À l’aide des fonctions rep, seq et c seulement, générer les séquences suivantes.
a) 0 6 0 6 0 6
b) 1 4 7 10
c) 1 2 3 1 2 3 1 2 3 1 2 3
d) 1 2 2 3 3 3
e) 1 1 1 2 2 3
f) 1 5.5 10
g) 1 1 1 1 2 2 2 2 3 3 3 3
Correction
a) > rep(c(0, 6), 3)
b) > seq(1, 10, by = 3)
c) > rep(1:3, 4)
d) > rep(1:3, 1:3)
e) > rep(1:3, 3:1)
f) > seq(1, 10, length = 3)
Exercice 5
Générer les suites de nombres suivantes à l’aide des fonctions : et rep
seulement, donc sans utiliser la fonction seq.
a) 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
b) 1 3 5 7 9 11 13 15 17 19
c) -2 -1 0 1 2 -2 -1 0 1 2
d) -2 -2 -1 -1 0 0 1 1 2 2
e) 10 20 30 40 50 60 70 80 90 100
Correction
a) > 11:20/10
b) > 2 * 0:9 + 1
c) > rep(-2:2, 2)
d) > rep(-2:2, each = 2)
e) > 10 * 1:10
Exercice 6
À l’aide de la commande apply, écrire des expressions S qui remplaceraient
les fonctions suivantes.
a) rowSums
b) colSums
31
c) rowMeans
d) colMeans
Correction
Soit mat une matrice.
a) > apply(mat, 1, sum)
b) > apply(mat, 2, sum)
c) > apply(mat, 1, mean)
d) > apply(mat, 2, mean)
Exercice 7
Sans utiliser les fonctions factorial, générer la séquence 1 !, 2 !, ..., 10 !
Correction
> cumprod(1:10)
Exercice 8
Simuler un échantillon (x1, x2, x3, ..., x20) avec la fonction sample.
Écrire une expression R permettant d’obtenir ou de calculer chacun des
résultats demandés ci-dessous.
a) Les cinq premiers éléments de l’échantillon.
b) La valeur maximale de l’échantillon.
c) La moyenne des cinq premiers éléments de l’échantillon.
d) La moyenne des cinq derniers éléments de l’échantillon.
Correction
> x<-rnorm(12)
a) > x[1:5]
> head(x, 5)
b) > max(x)
c) > mean(x[1:5])
> mean(head(x, 5))
d) > mean(x[16:20])
> mean(x[(length(x) - 4):length(x)])
> mean(tail(x, 5))
> mean(rev(x)[1:5])
Exercice 9
Simuler une matrice mat 7×10, puis écrire des expressions R permettant
d’effectuer les tâches demandées ci-dessous.
a) Calculer la somme des éléments de chacunes des lignes de la matrice.
b) Calculer la moyenne des éléments de chacunes des colonnes de la
matrice.
c) Calculer la valeur maximale de la sous-matrice formée par les trois
premières lignes et les trois premières colonnes.
d) Extraire toutes les lignes de la matrice dont la moyenne des éléments
est supérieure à 7.
Correction
> mat<-matrix(rnorm(70),7,10)
a) > rowSums(mat)
b) > colMeans(mat)
c) > max(mat[1:3, 1:3])
d) > mat[rowMeans(mat) > 7, ]
32
Exercice 10
Ecrire une fonction permettant de calculer les moyennes et variances d’un vecteur de deux
composantes.
Correction
desc<-function (x,y){
moyenne<-numeric(2)
var<-numeric(2)
moyenne[1]<-mean(x)
moyenne[2]<-mean(y)
var[1]<-var(x)
var[2]<-var(y)
cat("Moyennes",moyenne,"\n")
cat("Variances",var,"\n")
}
> desc(rnorm(32),rnorm(43))
Moyennes 0.1235007 0.08347782
Variances 1.045373 1.250843
Exercice 11
Ecrire une fonction permettant de calculer les moyennes et variances d’un vecteur ayant un
nombre arbitraire de composantes.
Correction
many.means<-function (...){
data <- list(...)
n<-length(data)
means<-numeric(n)
vars<-numeric(n)
for (i in 1:n){
means[i]<-mean(data[[i]])
vars[i]<-var(data[[i]])
}
cat("Moyennes",round(means,3),"\n")
cat("Variances",round(vars,3),"\n")
}
Exemple :
> many.means(rnorm(100,4,2),rnorm(43))
Moyennes 3.982 -0.012
Variances 4.408 0.944
Exercice 12
Écrire une fonction qui prend en paramètre un ensemble de valeurs et qui retourne une
liste contenant le nombre de valeurs, la moyenne et l’écart-type.
Correction
desc <- function (x){
ans <- list ()
ans$taille <- length (x)
ans$moyenne <- mean (x)
ans$ecarttype <-sd(x)
print (ans)
}
Exemple
33
> desc(rnorm(32))
$taille
[1] 32
$moyenne
[1] -0.04655557
$ecarttype
[1] 0.9266683
Exercice 13
Écrire une fonction centrer() qui « centre » les variables du tableau de données data de la page
25 (autrement dit qui retranche à chaque élément d’une colonne, la moyenne de cette colonne).
On pourra procéder de deux façons :
– En calculant les moyennes des colonnes avec mean
– En utilisant la fonction scale().
(Remarque : on pourra vérifier que les variables à centrer sont bien quantitatives avant
d’effectuer la transformation)
Correction
1) centre<-function(DR)
{
aa<-data.frame()
d<-dim(DR)[2]
for (i in 1:d)
if (is.numeric(DR[,i]))
DR[,i]<-DR[,i]-mean(DR[,i])
aa<-DR
print(aa)}
2) cen<-function(DR)
{
aa<-data.frame()
d<-dim(DR)[2]
for (i in 1:d)
if (is.numeric(DR[,i]))
DR[,i]<-scale(DR[,i],scale=FALSE)
aa<-DR
print(aa)}
Exercice 14
Calculer la fonction de probabilité de la loi binomiale
Correction
binome <- function(n,p) factorial(n)/(factorial(p)*
+ factorial (n-p))
Autre possibilité : choose(n,p)
Exercice 15
Describe how to insert a value between two elements of a vector at a
given position by using the append function (use the help system to find
out). Without append, how would you do it?
Réponse
1)
> x<-1:5
34
> x<-append(x,9,after=3)
>x
[1] 1 2 3 9 4 5
2)
> x<-1:5
> x<-c(x[1:3],9,x[4:length(x)])
>x
[1] 1 2 3 9 4 5
Exercice 16
1) Ecrire l’instruction permettant de simuler un échantillon de taille 100 tiré selon la loi de
poisson de paramètre 2.2.
2) Ecrire l’instruction permettant d’obtenir le tableau des effectifs.
3) Ecrire l’instruction permettant d’obtenir le nombre de 0 et le nombre de 3, les deux à la fois.
Correction
1) > x<-rpois(100,2.2)
2) > table(x)
x
0 1 2 3 4 5 8
14 22 23 20 15 5 1
3) > table(x)[c(1,4)]
x
0 3
14 20
Autre possibilité
> table(x)[c("0","3")]
x
0 3
14 20
Exercice 17
Ecrire l’instruction permettant de calculer la variance de x.
> x = c(2,3,5,7,11)
Correction
> x = c(2,3,5,7,11)
> xbar = mean(x)
> x-xbar # the difference
[1] −3.6 −2.6 −0.6 1.4 5.4
> (x−xbar)^2 # the squared difference
[1] 12.96 6.76 0.36 1.96 29.16
> sum((x−xbar)^2) # sum of squared
differences
[1] 51.2
> n = length(x)
>n
[1] 5
> var(x)<-sum((x−xbar)^2)/ (n-1)
35
[1] 12.8
Exercice 18
Générer un facteur de 20 éléments dont les valeurs sont choisiesaléatoirement parmi "oui", "non"
et "peut-être". Tester les fonctions table() et levels() sur ce vecteur.
Correction
> x<- factor ( sample (c(" oui "," non "," peut-être ") ,20 , replace =T))
> table (x)
x
non oui peut-être
13 14 13
> levels (x)
[1] " non" "oui " "peut-être "
Exercice 19
1. Générer une matrice M, 10×5, aléatoire (avec des valeurs réelles comprises entre 0 et 1)
(utiliser la fonction runif).
2. Déterminer le nombre d’éléments supérieurs à 0.9.
3. Remplacer les éléments de M inférieurs à 0.5 par des 0.
4. Tester et vérifier son type et la nature de ses éléments.
5. Créer un data frame à partir de M. Vérifier.
6. Extraire le vecteur correspondant à la troisième colonne de M.
7. Extraire la liste correspondant à la deuxième ligne de M.
Correction
1. > M<- matrix ( runif (50) , nrow =10)
2. > length (M[M >0.9]) ou sum(M>0.9)
3. > M[M <0.5] < -0
4. > mode(M); typeof (M)
> class (M)
5. > MDF <- as.data.frame (M) ou MDF<-data.frame(M)
6. > M[ ,3]
7. > M[2 ,]
Exercice 20
Générer un data frame tel que :
- le nombre d’individus (lignes) est de 5
- les variables (colonnes) sont nommées : "Sexe" "Âge" puis "Note 1", "Note 2"
- le sexe de chaque individu est choisi au hasard parmi "Masculin" et "Féminin" ;
- les notes sont générées aléatoirement entre 0 et 20.
- l’âge d’un individu est gégéré aléatoirement entre 18 et 24.
1) Extraire le sous-ensemble des données correspondant aux variables "Note 1
", "Note 2"
2) Extraire le sous-ensemble des données correspondant aux filles.
Correction
sexe <- sample (c("M","F") ,5, replace =T)
age <- sample (18:24 ,5, replace =T)
note1 <- sample (0:20 ,5 , replace =T)
note2 <- sample (0:20 ,5, replace =T)
36
DF <- data.frame (Age=age , Sexe =sexe , note1=note1,note2=note2 )
1) DF[,c("note1","note2")] ou DF[,c(3,4)]
2) > DF[DF$Sexe=="F",] ou DF[DF[,2] =="F",]
Exercice 21
Que renvoie ces instructions :
x<-as.factor(c("apple", "apple", "orange", "apple", "orange"))
as. numeric(x)
levels(x)
x
Exercice 22
Ecrire une fonction permettant de calculer la médiane d’un vecteur x de n composantes distinctes.
Rappelons que si n est pair la médiane est la moyenne des 2 points milieu et que si n est impair la
médiane est le point milieu.
Indication : Classer au préalable les n données par ordre croissant.
Correction
f<-function(x){
n<-length(x)
x<-sort(x)
if ( n%%2==0) X<-(x[n/2]+x[n/2+1])/2
else X<-x[(n+1)/2]
return(X)
}
Exercice 23
Considerons la fonction y f ( x ) définie par :
x 0 ]0,1] 1
f ( x) x3 x 2
x
Ecrire une fonction R permettant d’obtenir la valeur de y pour une valeur quelconque de x .
Correction
f<-function(x) ifelse(x<=0,-x,ifelse(x>1,sqrt(x),x))
f<-function(x){
if (x<=0) X<- -x
else
if (x>1) X<-sqrt(x)
else X<-x
return(X)
}
Exercice 24
Ecrire une fonction permettant le calcul du produit de deux matrices quelconques
Correction
f<-function(M,N){
m<-dim(M)[2];n<-dim(N)[1]
if (m!=n) stop("dsd")
else
p<-dim(M)[1];q<-dim(N)[2]
mat<-matrix(NA,p,q)
for (i in 1:p){
37
for (j in 1:q) {
mat[i,j]<-M[i,]%*%N[,j]
}
}
return(mat)
}
M<-matrix(1,2,3)
N<-matrix(2,3,2)
f(M,N)
M%*%N
38