0% found this document useful (0 votes)
4 views59 pages

Module 8

The document is a course outline for DA112: Introduction to R, focusing on dataframes, which are essential data structures in R. It covers topics such as creating dataframes, their attributes, subsetting, and manipulating data within them. Additionally, it discusses combining dataframes using functions like rbind, cbind, and merge.

Uploaded by

jyrfjidjjhstull
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views59 pages

Module 8

The document is a course outline for DA112: Introduction to R, focusing on dataframes, which are essential data structures in R. It covers topics such as creating dataframes, their attributes, subsetting, and manipulating data within them. Additionally, it discusses combining dataframes using functions like rbind, cbind, and merge.

Uploaded by

jyrfjidjjhstull
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Bachelor of Science (Honours) in Data Science and Artificial Intelligence

DA112: Introduction to R
Course Instructor :Ayon Borthakur,
TAs :Rahul Goswami, Tenzin Dawa, Kumar Sanu

Optional Note: Usage of this material for teaching and education is considered fair use. However, if that includes posting
images to a website, or commercial usage that could be considered copyright infringement. In such case, you are required to
reach out to the author(s) or the Institute for permission before the usage.
Week 8
Dataframes
Learning Objectives

01 What is Dataframes?

02 How to create Dataframes

03 Dataframe Attributes
Replacein/Deleting/Coercing in
04
Dataframe
05 Apply function over dataframe

3
Introduction

4
Introduction to Dataframes

● Dataframes are the most common data structure in R


● Dataframes are two-dimensional arrays, with rows and columns
● Dataframes can store different types of data in each column
● Internally dataframes are lists of vectors of equal length

5
Creating Dataframes

6
Creating Dataframes

● Dataframes can be created using the data.frame() function


# Create a dataframe
df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)

# Print the dataframe


print(df)
## name age married
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
note : An important change that happened in R 4.0.0 is that stringsAsFactors is now FALSE by default. This
means that character vectors are no longer automatically converted to factors when creating a data frame. If
you want to convert a character vector to a factor, you need to do it explicitly using the factor() function.

7
column names

The columns of a dataframe is always named, what will happen if we try to create a dataframe without
column names?
# Create a dataframe without column names
df <- data.frame(
c("Alice", "Bob", "Charlie"),
c(25, 30, 35),
c(TRUE, FALSE, TRUE)
)

# Print the dataframe


print(df)
## c..Alice....Bob....Charlie.. c.25..30..35. c.TRUE..FALSE..TRUE.
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
note : If you create a dataframe without column names, R will automatically generate column names for you.

8
Dataframe over a matrix

● Dataframes are similar to matrices, but with some important differences


● Dataframes can store different types of data in each column
(df <- data.frame(
name = c("Petra", "Jochen", "Alexander"), # character
age = c(35L, 21L, 12L), # integer
height = c(1.72, 1.65, 1.39), # numeric
austrian = c(FALSE, TRUE, TRUE), # logical
stringsAsFactors = FALSE # default
))
## name age height austrian
## 1 Petra 35 1.72 FALSE
## 2 Jochen 21 1.65 TRUE
## 3 Alexander 12 1.39 TRUE

9
Lets take a look at structure

str(df)
## 'data.frame': 3 obs. of 4 variables:
## $ name : chr "Petra" "Jochen" "Alexander"
## $ age : int 35 21 12
## $ height : num 1.72 1.65 1.39
## $ austrian: logi FALSE TRUE TRUE

10
Dataframe Attributes

11
Dataframe Attributes

There are several attributes that can be set for a dataframe following are some of the
important attributes

● row.names : The row names of the dataframe


● col.names : The column names of the dataframe
● names : The column names of the dataframe
● class : The class of the dataframe
● dim : The dimensions of the dataframe
● nrows : The number of rows in the dataframe
● ncols : The number of columns in the dataframe

12
row.names and col.names

# Setting row names


row.names(df) <- c("Row 1", "Row 2", "Row 3")

# Setting column names


colnames(df) <- c("Name", "Age", "Height", "Austrian")
df
## Name Age Height Austrian
## Row 1 Petra 35 1.72 FALSE
## Row 2 Jochen 21 1.65 TRUE
## Row 3 Alexander 12 1.39 TRUE

13
names and class

# Getting the column names


names(df)
## [1] "Name" "Age" "Height" "Austrian"
# Getting the class of the dataframe
class(df)
## [1] "data.frame"

14
dim and nrows and ncols

# Getting the dimensions of the dataframe


dim(df)
## [1] 3 4
# Getting the number of rows in the dataframe
nrow(df)
## [1] 3
# Getting the number of columns in the dataframe
ncol(df)
## [1] 4

15
Subsetting Dataframes

16
Subsetting Dataframes

Subsetting one of the vital operation in data manipulation, and since the
dataframes are most used datastructure we must understand the different
criteria under which we can subset a dataframeThere are various way to
subset a dataframe in R

17
List style subsetting

# List style subsetting


df[1] # returns a subdataframe with the first column
## name
## 1 Petra
## 2 Jochen
## 3 Alexander

18
List style subsetting continued

# List style subsetting


df['name'] # returns a subdataframe with the column
named 'name'
## name
## 1 Petra
## 2 Jochen
## 3 Alexander

19
List style subsetting continued

# List style subsetting


df[c('name', 'age')] # returns a subdataframe with the
columns 'name' and 'age'
## name age
## 1 Petra 35
## 2 Jochen 21
## 3 Alexander 12

20
List style subsetting continued

# List style subsetting


df$name # returns a vector with the column named 'name'
## [1] "Petra" "Jochen" "Alexander"

21
List style subsetting continued

# List style subsetting by logical vector


df[c(TRUE, FALSE, TRUE)] # returns a subdataframe with
the first and third column
## name height austrian
## 1 Petra 1.72 FALSE
## 2 Jochen 1.65 TRUE
## 3 Alexander 1.39 TRUE

22
Matrix style subsetting

# Matrix style subsetting


df[1, ] # returns a subdataframe with the first row
## name age height austrian
## 1 Petra 35 1.72 FALSE

23
Matrix style subsetting continued

# Matrix style subsetting


df[2:3,1:3] # returns a subdataframe with the second
and third row and the first three columns
## name age height
## 2 Jochen 21 1.65
## 3 Alexander 12 1.39

24
The subset function

The subset function is a powerful function to subset dataframes


Usage : subset(x, subset, select, drop = FALSE, …)
● x : dataframe (or an object coercible to a dataframe)
● subset : logical expression indicating elements or rows to keep, missing
values are taken as false
● select : expression, indicating columns to select from x
● drop : ipassed on to ‘[’ indexing operator.

25
The subset function continued

A demo dataframe to demonstrate the subset function


(df <- data.frame(
name = c("Petra", "Jochen", "Alexander"), # character
age = c(35L, 21L, 12L), # integer
height = c(1.72, 1.65, 1.39), # numeric
austrian = c(FALSE, TRUE, TRUE), # logical
stringsAsFactors = FALSE # default
))
## name age height austrian
## 1 Petra 35 1.72 FALSE
## 2 Jochen 21 1.65 TRUE
## 3 Alexander 12 1.39 TRUE
subset age greter than 18 and columns name and height
subset(df, age > 18, select = c(name, height))
## name height
## 1 Petra 1.72
## 2 Jochen 1.65
Students (Try it out): It can also be done using the df[] notation

26
The subset function continued

Subset only who are not austrian from the dataframe


subset(df, !austrian)
## name age height austrian
## 1 Petra 35 1.72 FALSE

27
The subset function continued

The return of subset() will be a data frame if the first argument x is of class
data frame – except if we select one row and set drop = TRUE. In this case we
will only get a vector, in the example below a logical vector.

subset(df, age >= 18, austrian, drop = TRUE)


## [1] FALSE TRUE

28
Graphical summary

29
Replacing/Deleting/Ad
ding Variables

30
Replacing/Deleting/Adding variables

To delete a variable from a dataframe, you can set it to NULL


# Delete the column 'name'
df$name <- NULL
print(df)
## age height austrian
## 1 35 1.72 FALSE
## 2 21 1.65 TRUE
## 3 12 1.39 TRUE

31
Replacing/Deleting/Adding variables continued

# Adds a completely new variable


df$nationality <- ifelse(df$austrian, "AT", NA)
# Replaces an existing colmn
df$height <- as.integer(df$height * 100)
# Replace one element
df$age[2] <- 102
# Print resulting data frame
df
## age height austrian nationality
## 1 35 172 FALSE <NA>
## 2 102 165 TRUE AT
## 3 12 139 TRUE AT

32
Coercion

33
Coercion

To an extent we coerce objects in R to a data frame. For example, a matrix


can be coerced to a data frame.

mat <- matrix(1:6, nrow = 2, dimnames = list(c("Row 1",


"Row 2"), LETTERS[1:3]))
df <- as.data.frame(mat)
print(df)
## A B C
## Row 1 1 3 5
## Row 2 2 4 6

34
Coercion of Heteregenous dataframe to matrix

df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)

# Coerce to matrix and back to data frame


(df2 <- as.data.frame(as.matrix(df)))
## name age married
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
# Check if the two dataframes are identical
identical(df, df2)
## [1] FALSE

35
Coercion of Heteregenous dataframe to matrix continued

Well in the prevous example df and df2 are not identical, why?
Lets check mean of age in df2
mean(df2$age)
## Warning in mean.default(df2$age): argument is not numeric or
logical: returning
## NA
## [1] NA
We got NA, why?, lets check the structure of df2
str(df2)
## 'data.frame': 3 obs. of 3 variables:
## $ name : chr "Alice" "Bob" "Charlie"
## $ age : chr "25" "30" "35"
## $ married: chr "TRUE" "FALSE" "TRUE"
The age column is coerced to character in the process of coercion since the matrix can only have
one type of data.

36
list to dataframe

df <- as.data.frame(list(x = c(1, 2, 3, 4), y = c("A",


"B")))
print(df)
## x y
## 1 1 A
## 2 2 B
## 3 3 A
## 4 4 B

37
list to dataframe continued

as.list(df)
## $x
## [1] 1 2 3 4
##
## $y
## [1] "A" "B" "A" "B"

38
Combining Dataframe

39
Combining dataframes

The usual cbind and rbind functions can be used to combine dataframes.
df1 <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)

df2 <- data.frame(


name = c("David", "Eve"),
age = c(40, 45),
married = c(FALSE, TRUE)
)

40
Combining dataframes continued

# Combine the two dataframes

df <- rbind(df1, df2)


print(df)
## name age married
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
## 4 David 40 FALSE
## 5 Eve 45 TRUE

41
Combining dataframes continued

df1 <- data.frame(


name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)

df2 <- data.frame(


name = c("Alice", "Bob", "Charlie"),
height = c(1.72, 1.65, 1.39),
austrian = c(FALSE, TRUE, TRUE)
)

# Combine the two dataframes


df <- cbind(df1, df2)
print(df)
## name age married name height austrian
## 1 Alice 25 TRUE Alice 1.72 FALSE
## 2 Bob 30 FALSE Bob 1.65 TRUE
## 3 Charlie 35 TRUE Charlie 1.39 TRUE 42
Combining dataframes continued

We can also use the merge function to combine dataframes


df1 <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
married = c(TRUE, FALSE, TRUE)
)

df2 <- data.frame(


name = c("Alice", "Bob", "Charlie"),
height = c(1.72, 1.65, 1.39),
austrian = c(FALSE, TRUE, TRUE)
)

# Combine the two dataframes


df <- merge(df1, df2, by = "name")
print(df)
## name age married height austrian
## 1 Alice 25 TRUE 1.72 FALSE
## 2 Bob 30 FALSE 1.65 TRUE
## 3 Charlie 35 TRUE 1.39 TRUE

43
Graphical summary

A quick graphical summary of the


different (correct and wrong) ways
of combining data frames. We have
three small data frames with two
observations each. The first two
(left) share the same variable
names and contain geographical
location of some cities. The last
(right) shares the same values in
one column (name) with the data
frame bottom left, but contains
different information.

44
Graphical summary

Row binding: As the two data


frames on the left have the same
number of variables (columns) we
can use rbind(df1, df2) to combine
them. Warning: base R does not

binds them together!

45
Graphical summary

Column binding: When having two


objects with the same number of
rows, we can call cbind(df2, df3).
Again, cbind() does not care about
what is in there, just combines
them.

46
Graphical summary

47
Graphical summary

Merging: merge(df2, df3, by =


“ ” ‘ - ’
the information correctly. Compares
the values in x𝑛𝑎𝑚𝑒𝑎𝑛𝑑𝑦name and
correctly combines the information.

48
Apply Functions

49
Function Return value
Apply functions lapply list
sapply try to simplify the result to a
we have already seen apply vector or matrix
function in the previous weeks, lets vapply similar to sapply but allows
see some more you to specify the type of the
return value

50
apply functions Usage

● lapply(X, FUN, …)
● sapply(X, FUN, …)
● vapply(X, FUN, FUN.VALUE, …,)

Lets create a dataframe for demo


(df <- data.frame(
name = c("Petra", "Jochen", "Alexander"), # character
age = c(35L, 21L, 12L), # integer
height = c(1.72, 1.65, 1.39), # numeric
austrian = c(FALSE, TRUE, TRUE), # logical
stringsAsFactors = FALSE # default
))
## name age height austrian
## 1 Petra 35 1.72 FALSE
## 2 Jochen 21 1.65 TRUE
## 3 Alexander 12 1.39 TRUE

51
lapply

lapply(df, class)
## $name
## [1] "character"
##
## $age
## [1] "integer"
##
## $height
## [1] "numeric"
##
## $austrian
## [1] "logical"

52
sapply

sapply(df, class)
## name age height austrian
## "character" "integer" "numeric" "logical"

53
sapply continued

sapply(df, length)
## name age height austrian
## 3 3 3 3

54
sapply continued

sapply(df, mean)
## Warning in mean.default(X[[i]], ...): argument is
not numeric or logical:
## returning NA
## name age height austrian
## NA 22.6666667 1.5866667 0.6666667

55
sapply continued

sapply(df, function(x) if(is.numeric(x)) mean(x) else x)


## $name
## [1] "Petra" "Jochen" "Alexander"
##
## $age
## [1] 22.66667
##
## $height
## [1] 1.586667
##
## $austrian
## [1] FALSE TRUE TRUE

56
vapply

Return must be character

vapply(df, class, "")


## name age height austrian
## "character" "integer" "numeric" "logical"

57
vapply continued

Return must be integer

vapply(df, length, vector("integer", 1)) # Return must be integer


## name age height austrian
## 3 3 3 3

58
Thank you

You might also like