Bachelor of Science (Honours) in Data Science and Artificial Intelligence
DA112: Introduction to R
Course Instructor :Ayon Borthakur,
TAs :Rahul Goswami, Tenzin Dawa, Kumar Sanu
Optional Note: Usage of this material for teaching and education is considered fair use. However, if that includes posting
images to a website, or commercial usage that could be considered copyright infringement. In such case, you are required to
reach out to the author(s) or the Institute for permission before the usage.
Week 8
Dataframes
Learning Objectives
01 What is Dataframes?
02 How to create Dataframes
03 Dataframe Attributes
Replacein/Deleting/Coercing in
04
Dataframe
05 Apply function over dataframe
3
Introduction
4
Introduction to Dataframes
● Dataframes are the most common data structure in R
● Dataframes are two-dimensional arrays, with rows and columns
● Dataframes can store different types of data in each column
● Internally dataframes are lists of vectors of equal length
5
Creating Dataframes
6
Creating Dataframes
● Dataframes can be created using the data.frame() function
# Create a dataframe
df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)
# Print the dataframe
print(df)
## name age married
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
note : An important change that happened in R 4.0.0 is that stringsAsFactors is now FALSE by default. This
means that character vectors are no longer automatically converted to factors when creating a data frame. If
you want to convert a character vector to a factor, you need to do it explicitly using the factor() function.
7
column names
The columns of a dataframe is always named, what will happen if we try to create a dataframe without
column names?
# Create a dataframe without column names
df <- data.frame(
c("Alice", "Bob", "Charlie"),
c(25, 30, 35),
c(TRUE, FALSE, TRUE)
)
# Print the dataframe
print(df)
## c..Alice....Bob....Charlie.. c.25..30..35. c.TRUE..FALSE..TRUE.
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
note : If you create a dataframe without column names, R will automatically generate column names for you.
8
Dataframe over a matrix
● Dataframes are similar to matrices, but with some important differences
● Dataframes can store different types of data in each column
(df <- data.frame(
name = c("Petra", "Jochen", "Alexander"), # character
age = c(35L, 21L, 12L), # integer
height = c(1.72, 1.65, 1.39), # numeric
austrian = c(FALSE, TRUE, TRUE), # logical
stringsAsFactors = FALSE # default
))
## name age height austrian
## 1 Petra 35 1.72 FALSE
## 2 Jochen 21 1.65 TRUE
## 3 Alexander 12 1.39 TRUE
9
Lets take a look at structure
str(df)
## 'data.frame': 3 obs. of 4 variables:
## $ name : chr "Petra" "Jochen" "Alexander"
## $ age : int 35 21 12
## $ height : num 1.72 1.65 1.39
## $ austrian: logi FALSE TRUE TRUE
10
Dataframe Attributes
11
Dataframe Attributes
There are several attributes that can be set for a dataframe following are some of the
important attributes
● row.names : The row names of the dataframe
● col.names : The column names of the dataframe
● names : The column names of the dataframe
● class : The class of the dataframe
● dim : The dimensions of the dataframe
● nrows : The number of rows in the dataframe
● ncols : The number of columns in the dataframe
12
row.names and col.names
# Setting row names
row.names(df) <- c("Row 1", "Row 2", "Row 3")
# Setting column names
colnames(df) <- c("Name", "Age", "Height", "Austrian")
df
## Name Age Height Austrian
## Row 1 Petra 35 1.72 FALSE
## Row 2 Jochen 21 1.65 TRUE
## Row 3 Alexander 12 1.39 TRUE
13
names and class
# Getting the column names
names(df)
## [1] "Name" "Age" "Height" "Austrian"
# Getting the class of the dataframe
class(df)
## [1] "data.frame"
14
dim and nrows and ncols
# Getting the dimensions of the dataframe
dim(df)
## [1] 3 4
# Getting the number of rows in the dataframe
nrow(df)
## [1] 3
# Getting the number of columns in the dataframe
ncol(df)
## [1] 4
15
Subsetting Dataframes
16
Subsetting Dataframes
Subsetting one of the vital operation in data manipulation, and since the
dataframes are most used datastructure we must understand the different
criteria under which we can subset a dataframeThere are various way to
subset a dataframe in R
17
List style subsetting
# List style subsetting
df[1] # returns a subdataframe with the first column
## name
## 1 Petra
## 2 Jochen
## 3 Alexander
18
List style subsetting continued
# List style subsetting
df['name'] # returns a subdataframe with the column
named 'name'
## name
## 1 Petra
## 2 Jochen
## 3 Alexander
19
List style subsetting continued
# List style subsetting
df[c('name', 'age')] # returns a subdataframe with the
columns 'name' and 'age'
## name age
## 1 Petra 35
## 2 Jochen 21
## 3 Alexander 12
20
List style subsetting continued
# List style subsetting
df$name # returns a vector with the column named 'name'
## [1] "Petra" "Jochen" "Alexander"
21
List style subsetting continued
# List style subsetting by logical vector
df[c(TRUE, FALSE, TRUE)] # returns a subdataframe with
the first and third column
## name height austrian
## 1 Petra 1.72 FALSE
## 2 Jochen 1.65 TRUE
## 3 Alexander 1.39 TRUE
22
Matrix style subsetting
# Matrix style subsetting
df[1, ] # returns a subdataframe with the first row
## name age height austrian
## 1 Petra 35 1.72 FALSE
23
Matrix style subsetting continued
# Matrix style subsetting
df[2:3,1:3] # returns a subdataframe with the second
and third row and the first three columns
## name age height
## 2 Jochen 21 1.65
## 3 Alexander 12 1.39
24
The subset function
The subset function is a powerful function to subset dataframes
Usage : subset(x, subset, select, drop = FALSE, …)
● x : dataframe (or an object coercible to a dataframe)
● subset : logical expression indicating elements or rows to keep, missing
values are taken as false
● select : expression, indicating columns to select from x
● drop : ipassed on to ‘[’ indexing operator.
25
The subset function continued
A demo dataframe to demonstrate the subset function
(df <- data.frame(
name = c("Petra", "Jochen", "Alexander"), # character
age = c(35L, 21L, 12L), # integer
height = c(1.72, 1.65, 1.39), # numeric
austrian = c(FALSE, TRUE, TRUE), # logical
stringsAsFactors = FALSE # default
))
## name age height austrian
## 1 Petra 35 1.72 FALSE
## 2 Jochen 21 1.65 TRUE
## 3 Alexander 12 1.39 TRUE
subset age greter than 18 and columns name and height
subset(df, age > 18, select = c(name, height))
## name height
## 1 Petra 1.72
## 2 Jochen 1.65
Students (Try it out): It can also be done using the df[] notation
26
The subset function continued
Subset only who are not austrian from the dataframe
subset(df, !austrian)
## name age height austrian
## 1 Petra 35 1.72 FALSE
27
The subset function continued
The return of subset() will be a data frame if the first argument x is of class
data frame – except if we select one row and set drop = TRUE. In this case we
will only get a vector, in the example below a logical vector.
subset(df, age >= 18, austrian, drop = TRUE)
## [1] FALSE TRUE
28
Graphical summary
29
Replacing/Deleting/Ad
ding Variables
30
Replacing/Deleting/Adding variables
To delete a variable from a dataframe, you can set it to NULL
# Delete the column 'name'
df$name <- NULL
print(df)
## age height austrian
## 1 35 1.72 FALSE
## 2 21 1.65 TRUE
## 3 12 1.39 TRUE
31
Replacing/Deleting/Adding variables continued
# Adds a completely new variable
df$nationality <- ifelse(df$austrian, "AT", NA)
# Replaces an existing colmn
df$height <- as.integer(df$height * 100)
# Replace one element
df$age[2] <- 102
# Print resulting data frame
df
## age height austrian nationality
## 1 35 172 FALSE <NA>
## 2 102 165 TRUE AT
## 3 12 139 TRUE AT
32
Coercion
33
Coercion
To an extent we coerce objects in R to a data frame. For example, a matrix
can be coerced to a data frame.
mat <- matrix(1:6, nrow = 2, dimnames = list(c("Row 1",
"Row 2"), LETTERS[1:3]))
df <- as.data.frame(mat)
print(df)
## A B C
## Row 1 1 3 5
## Row 2 2 4 6
34
Coercion of Heteregenous dataframe to matrix
df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)
# Coerce to matrix and back to data frame
(df2 <- as.data.frame(as.matrix(df)))
## name age married
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
# Check if the two dataframes are identical
identical(df, df2)
## [1] FALSE
35
Coercion of Heteregenous dataframe to matrix continued
Well in the prevous example df and df2 are not identical, why?
Lets check mean of age in df2
mean(df2$age)
## Warning in mean.default(df2$age): argument is not numeric or
logical: returning
## NA
## [1] NA
We got NA, why?, lets check the structure of df2
str(df2)
## 'data.frame': 3 obs. of 3 variables:
## $ name : chr "Alice" "Bob" "Charlie"
## $ age : chr "25" "30" "35"
## $ married: chr "TRUE" "FALSE" "TRUE"
The age column is coerced to character in the process of coercion since the matrix can only have
one type of data.
36
list to dataframe
df <- as.data.frame(list(x = c(1, 2, 3, 4), y = c("A",
"B")))
print(df)
## x y
## 1 1 A
## 2 2 B
## 3 3 A
## 4 4 B
37
list to dataframe continued
as.list(df)
## $x
## [1] 1 2 3 4
##
## $y
## [1] "A" "B" "A" "B"
38
Combining Dataframe
39
Combining dataframes
The usual cbind and rbind functions can be used to combine dataframes.
df1 <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)
df2 <- data.frame(
name = c("David", "Eve"),
age = c(40, 45),
married = c(FALSE, TRUE)
)
40
Combining dataframes continued
# Combine the two dataframes
df <- rbind(df1, df2)
print(df)
## name age married
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
## 4 David 40 FALSE
## 5 Eve 45 TRUE
41
Combining dataframes continued
df1 <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)
df2 <- data.frame(
name = c("Alice", "Bob", "Charlie"),
height = c(1.72, 1.65, 1.39),
austrian = c(FALSE, TRUE, TRUE)
)
# Combine the two dataframes
df <- cbind(df1, df2)
print(df)
## name age married name height austrian
## 1 Alice 25 TRUE Alice 1.72 FALSE
## 2 Bob 30 FALSE Bob 1.65 TRUE
## 3 Charlie 35 TRUE Charlie 1.39 TRUE 42
Combining dataframes continued
We can also use the merge function to combine dataframes
df1 <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)
df2 <- data.frame(
name = c("Alice", "Bob", "Charlie"),
height = c(1.72, 1.65, 1.39),
austrian = c(FALSE, TRUE, TRUE)
)
# Combine the two dataframes
df <- merge(df1, df2, by = "name")
print(df)
## name age married height austrian
## 1 Alice 25 TRUE 1.72 FALSE
## 2 Bob 30 FALSE 1.65 TRUE
## 3 Charlie 35 TRUE 1.39 TRUE
43
Graphical summary
A quick graphical summary of the
different (correct and wrong) ways
of combining data frames. We have
three small data frames with two
observations each. The first two
(left) share the same variable
names and contain geographical
location of some cities. The last
(right) shares the same values in
one column (name) with the data
frame bottom left, but contains
different information.
44
Graphical summary
Row binding: As the two data
frames on the left have the same
number of variables (columns) we
can use rbind(df1, df2) to combine
them. Warning: base R does not
’
binds them together!
45
Graphical summary
Column binding: When having two
objects with the same number of
rows, we can call cbind(df2, df3).
Again, cbind() does not care about
what is in there, just combines
them.
46
Graphical summary
47
Graphical summary
Merging: merge(df2, df3, by =
“ ” ‘ - ’
the information correctly. Compares
the values in x𝑛𝑎𝑚𝑒𝑎𝑛𝑑𝑦name and
correctly combines the information.
48
Apply Functions
49
Function Return value
Apply functions lapply list
sapply try to simplify the result to a
we have already seen apply vector or matrix
function in the previous weeks, lets vapply similar to sapply but allows
see some more you to specify the type of the
return value
50
apply functions Usage
● lapply(X, FUN, …)
● sapply(X, FUN, …)
● vapply(X, FUN, FUN.VALUE, …,)
Lets create a dataframe for demo
(df <- data.frame(
name = c("Petra", "Jochen", "Alexander"), # character
age = c(35L, 21L, 12L), # integer
height = c(1.72, 1.65, 1.39), # numeric
austrian = c(FALSE, TRUE, TRUE), # logical
stringsAsFactors = FALSE # default
))
## name age height austrian
## 1 Petra 35 1.72 FALSE
## 2 Jochen 21 1.65 TRUE
## 3 Alexander 12 1.39 TRUE
51
lapply
lapply(df, class)
## $name
## [1] "character"
##
## $age
## [1] "integer"
##
## $height
## [1] "numeric"
##
## $austrian
## [1] "logical"
52
sapply
sapply(df, class)
## name age height austrian
## "character" "integer" "numeric" "logical"
53
sapply continued
sapply(df, length)
## name age height austrian
## 3 3 3 3
54
sapply continued
sapply(df, mean)
## Warning in mean.default(X[[i]], ...): argument is
not numeric or logical:
## returning NA
## name age height austrian
## NA 22.6666667 1.5866667 0.6666667
55
sapply continued
sapply(df, function(x) if(is.numeric(x)) mean(x) else x)
## $name
## [1] "Petra" "Jochen" "Alexander"
##
## $age
## [1] 22.66667
##
## $height
## [1] 1.586667
##
## $austrian
## [1] FALSE TRUE TRUE
56
vapply
Return must be character
vapply(df, class, "")
## name age height austrian
## "character" "integer" "numeric" "logical"
57
vapply continued
Return must be integer
vapply(df, length, vector("integer", 1)) # Return must be integer
## name age height austrian
## 3 3 3 3
58
Thank you