0% found this document useful (0 votes)

10 views

Module III

Uploaded by

saikumar.addanki990

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Module III

Uploaded by

saikumar.addanki990

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 53

Module - 3

Data Manipulation

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Sorting
• R provides a different way to sort the data either in ascending or descending order.
• Data-analysts, and Data scientists use order(), sort() and packages like dplyr to sort data
depending upon the structure of the obtained data.
• The function Order() can sort Vector, Matrix and also a Dataframe can be sorted in
ascending and descending order.
• Syntax of order():
• order(x, [decreasing = TRUE or FALSE], [na.last = TRUE or FLASE], [method =
c("auto", "shell", "quick", "radix")])
• The first argument x is the vector or data frame to be sorted.
• The second argument decreasing is a logical value that determines
whether the sorting should be in decreasing order (TRUE) or increasing
order (FALSE).
• The third argument na.last is also a logical value that determines
whether missing values should be placed at the end of the sorted vector
(TRUE) or at the beginning (FALSE).
• The fourth argument method is an optional argument that specifies the
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
sorting algorithm to be used.
Data Sorting
• The available options are "auto" (default), "shell", "quick", and "radix".
• The "auto" option automatically selects the best algorithm based on the
size and type of the input data.
• Overall, this code snippet is used to sort a vector or data frame in R using
the order() function with various options for sorting order, handling of
missing values, and sorting algorithm.
• Example:
• y = c(4,12,6,7,2,9,5)
• order(y)
• In this case, the output would be: 5 1 7 3 4 6 2, which
indicates that the smallest value is at index 5, the next
smallest is at index 1, and so on.
• Note that the original vector "y" is not modified by this
function. Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data Sorting
• Example:
• y = c(4,12,6,7,2,9,5)
• y[order(y)]
• The "order" function returns the indices of the sorted values,
so when we use these indices to subset "y" with square
brackets, we get the sorted values of "y".
• Therefore, the output of this code will be: 2 4 5 6 7 9 12.
• Example:
• x <- c(8,2,4,1,-4,NA,46,8,9,5,3)
• order(x,na.last = TRUE)
• The "na.last = TRUE" argument specifies that missing values
(NA) should be placed at the end of the sorted vector.
• So, the output of this code would be: 5 4 2 1 11 10 9 8 3 7 6.
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data Sorting
• Example:
• y = c(4,12,6,7,2,9,5)
• y[order(y)]
• The "order" function returns the indices of the sorted values,
so when we use these indices to subset "y" with square
brackets, we get the sorted values of "y".
• Therefore, the output of this code will be: 2 4 5 6 7 9 12.
• Example:
• x <- c(8,2,4,1,-4,NA,46,8,9,5,3)
• order(x,na.last = TRUE)
• The "na.last = TRUE" argument specifies that missing values
(NA) should be placed at the end of the sorted vector.
• So, the output of this code would be: 5 4 2 11 3 10 1 8 9 7 6.
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data Sorting
• Example:
• x <- c(8,2,4,1,-4,NA,46,8,9,5,3)
• order(x,na.last=FALSE)
• The "na.last = FALSE" argument specifies that missing values (NA)
should be placed at the end of the sorted vector.
• Therefore, the output of this code will be: 6 5 4 2 11 3 10 1 8 9 7.
• Example:
• x <- c(8,2,4,1,-4,NA,46,8,9,5,3)
• order(x,decreasing=TRUE,na.last=TRUE)
• The decreasing=TRUE argument specifies that the indices should
be sorted in descending order instead of ascending order.
• The na.last=TRUE argument specifies that missing values should be
placed at the end of the sorted vector.
• Overall, this code sorts the input vector x in descending order and
returns the indices that would sort x in that order.
• So, the output of this Prepared
code by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
would be: 7 9 1 8 10 3 11 2 4 5 6.
Sorting a data frame by using order()
• In R DataFrame is a two-dimensional tabular data structure that
consists of rows and columns.
• Sorting a DataFrame allows us to reorder the rows based on the
values in one or more columns.
• This can be useful for various purposes, such as organizing data for
analysis or presentation.
• Methods to sort a dataframe:
order() function (increasing and decreasing order)
arrange() function from dplyr package
setorder() function from data.table package

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Sorting a data frame by using order()
• Using order() function:
• This function is used to sort the dataframe based on the particular column in
the dataframe.
• Syntax: order(dataframe$column_name,decreasing = TRUE))
• where
• dataframe is the input dataframe
• Column name is the column in the dataframe such that dataframe is sorted based on
this column
• Decreasing parameter specifies the type of sorting order - If it is TRUE dataframe is
sorted in descending order. Otherwise, in increasing order.
• return type: Index positions of the elements.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Sorting a data frame by using order()
• Example 1: R program to create a dataframe with 2 columns and
order based on particular columns in decreasing order. Displayed
the Sorted dataframe based on subjects in decreasing order,
displayed the Sorted dataframe based on roll no in decreasing
order.
• # create dataframe with roll no and subjects columns
data = data.frame(
rollno = c(1, 5, 4, 2, 3),
subjects = c("java", "python", "php", "sql", "c"))
print(data)
print("sort the data in decreasing order based on subjects ")
print(data[order(data$subjects, decreasing = TRUE), ] )
print("sort the data in decreasing order based on rollno ")
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
print(data[order(data$rollno, decreasing = TRUE), ] )
Sorting a data frame by using order()
print(data)
print("sort the data in decreasing order based on subjects ")
print(data[order(data$subjects, decreasing = TRUE), ] )
print("sort the data in decreasing order based on rollno ")
print(data[order(data$rollno, decreasing = TRUE), ] )

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Sorting a data frame by using order()
• Example 2: Example 2: R program to create a dataframe
with 3 columns named roll no, names, and subjects with a
vector, displayed the Sorted dataframe based on subjects in
increasing order, displayed the Sorted dataframe based on
roll no in increasing order, displayed the Sorted dataframe
based on names in increasing order

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Sorting a data frame by using order()
# create dataframe with roll no, names and subjects columns
data=data.frame(rollno = c(1, 5, 4, 2, 3),
names = c("sravan", "bobby","pinkey", "rohith","ganesh"),
subjects = c("java", "python","php", "sql", "c"))

print(data)

print("sort the data in increasing order based on subjects")

print(data[order(data$subjects, decreasing = FALSE), ] )

print("sort the data in increasing order based on rollno")

print(data[order(data$rollno, decreasing = FALSE), ] )

print("sort the data in increasing order based on names")

print(data[order(data$names,decreasing = FALSE), ] )
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
• Storage administrators are struggling to handle spiraling volumes of
documents, audio, video, images and large email attachments
• Adding storage is not always the best solution
• Many companies are turning to data reduction technologies such as data
deduplication.
• Data Duplicacy:
• Entries that have been added by a system user multiple times
• for example, re registering because you have forgotten your details, It is one of the
problem which causes inconsistency in databases.
• Data Redundancy:
• Same of data is stored at multiple locations or tables.
• Data redundancy is costly to address as it requires
• additional storage, synchronization between databases.
• design work to align theby :information
Prepared Dr. Srinivasa Rao Pokuri, represented
Faculty SCOPE, VIT AP by different presentation of
Identify and Remove Duplicate
Data in R
• A dataset can have duplicate values and to keep it redundancy-free and
accurate, duplicate rows need to be identified and removed.
• In this module, we will see how to identify and remove duplicate data in R.
First we will check if duplicate data is present in our data, if yes then, we
will remove it.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Identify and Remove Duplicate
Data in R
• Identifying Duplicate Data in vector:
• We can use duplicated() function to find out how many duplicates value
are present in a vector.
• Syntax : duplicated(vector_name)
• The R function duplicated() returns a logical vector where TRUE specifies
which elements of a vector or data frame are duplicates.
• Example:
# Create a sample vector with duplicate elements
vector1<- c(1, 2, 3, 4, 4, 5)
# Identify duplicate elements
duplicated(vector1) #FALSE FALSE FALSE FALSE TRUE FALSE
# count of duplicated data
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
• Removing Duplicate Data in vector:
• We can remove duplicate data from vectors by using unique() functions so
it will give only unique values.
• Syntax : unique(vector_name)
• The R function duplicated() returns a logical vector where TRUE specifies
which elements of a vector or data frame are duplicates.
• Example:
# Create a sample vector with duplicate elements
vector1<- c(1, 2, 3, 4, 4, 5)
# Remove duplicate elements
unique(vector1) #1 2 3 4 5
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
• Identifying Duplicate Data in a data frame:
• For identification, we will use the duplicated() function.
• Syntax : duplicated(dataframe)
• Approach:
• Create data frame
• Pass it to duplicated() function
• This function returns the rows which are duplicated in form of boolean values
• Apply the sum() function to get the number of duplicates.
• Removing Duplicate Data in a data frame:
• we use unique() and distinct() functions.
• Approach:
• Create data frame
• Select rows which are unique
• Retrieve those rows Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
• Identifying Duplicate Data in a data frame:
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul","Cassie","Geeta","Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
student_result
duplicated(student_result)
sum(duplicated(student_result))

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Identify and Remove Duplicate
Data in R
• Identifying Duplicate Data in a data frame:
# create dataframe
data=data.frame(names=c("manoj","bobby","sravan", "deepu","manoj","bobby") ,
id=c(1,2,3,4,1,2),
subject=c("java","python","php","html","java","python"))
data
# remove duplicate rows in subject column
print(data[!duplicated(data$subject), ])
# remove duplicate rows in names column
print(data[!duplicated(data$names), ])
# remove duplicate rows in id column
print(data[!duplicated(data$id), ])
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
• Identifying Duplicate Data in a data frame:
# create dataframe
data=data.frame(names=c("manoj","bobby","sravan", "deepu","manoj","bobby") ,
id=c(1,2,3,1,1,2),
subject=c("java","python","php","html","java","python"))
data
# remove duplicate rows in subject column
print(data[!duplicated(data$subject), ])
# remove duplicate rows in names column
print(data[!duplicated(data$names), ])
# remove duplicate rows in id column
print(data[!duplicated(data$id), ])
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
• Removing Duplicate Data in a data frame:
• Method 1: Using unique()
• We use unique() to get rows having unique values in our data.
• Syntax: unique(dataframe)
• Example:
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
student
# Printing data without duplicates using unique
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
unique(student)
Identify and Remove Duplicate
Data in R
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
Student

# Printing data without duplicates using unique

unique(student)
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
• Method 2: Using distinct()
• This method is available in dplyr package which is used to get the unique rows from the
dataframe.
• We can remove rows from the entire which are duplicates and also we can remove duplicate
rows in a particular column.
• Syntax: distinct(dataframe,keepall)
• Where dataframe -> data in use and keepall -> decides which variables to keep
• Example:
library(tidyverse) #load library
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))
# Printing data
student
# Printing data without duplicates using
Prepared by distinct
: Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
Student

# Printing data without duplicates using distinct

dplyr::distinct(student)
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
# load the package
library(dplyr)
# create dataframe
data=data.frame(names=c("manoj","bobby","sravan","deepu","manoj","bobby")
,
id=c(1,2,3,4,1,2),
subjects=c("java","python","php","html","java","python")
# remove all duplicate rows
print(dplyr::distinct(data))
# remove duplicate rows in subjects column
print(dplyr::distinct(data,subjects))
# remove duplicate rows in namescolumn
print(dplyr::distinct(data,names))
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
# remove all duplicate rows
print(dplyr::distinct(data))

# remove duplicate rows in subjects column

print(dplyr::distinct(data,subjects))

# remove duplicate rows in namescolumn

print(dplyr::distinct(data,names))
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Identify and Remove Duplicate
Data in R
Example: # Creating a sample data frame of students and their marks in respective
subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul", "Cassie","Geeta","Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
student_result
dplyr::distinct(student_result,maths,.keep_all = TRUE)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
• Data Cleaning in R is the process to transform raw data into consistent
data that can be easily analysed.
• It is aimed at filtering the content of statistical statements based on the
data as well as their reliability.
• Moreover, it influences the statistical statements based on the data and
improves your data quality and overall productivity.
• Purpose of Data Cleaning:
• The following are the various purposes of data cleaning in R
• Eliminate Errors
• Eliminate Redundancy
• Increase Data Reliability
• Delivery Accuracy
• Ensure Consistency
• Assure Completeness Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data Cleaning
• Overview of a typical data analysis chain:

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
• For better understanding let us discuss with one example
1) Creation of Example Data (Data Frame)
2) Modify Column Names
3) Format Missing Values
4) Remove Empty Rows & Columns
5) Remove Rows with Missing Values
6) Remove Duplicates
7) Modify Classes of Columns
8) Detect & Remove Outliers
9) Remove Spaces in Character Strings

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
1) Creation of Example Data:
data <- data.frame(x1 = c(1:4, 99999, 1, NA, 1, 1, NA), # Create example data frame
x1 = c(1:5, 1, "NA", 1, 1, "NA"),
x1 = c(letters[c(1:3)], "x x", "x", " y y y", "x", "a", "a", NA),
x4 = "",
x5 = NA)
data # Printing data frame

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
2) Modify Column Names:
data <- data.frame(x1 = c(1:4, 99999, 1, NA, 1, 1, NA), # Create example data frame
x1 = c(1:5, 1, "NA", 1, 1, "NA"),
x1 = c(letters[c(1:3)], "x x", "x", " y y y", "x", "a", "a", NA),
x4 = "",
x5 = NA)
colnames(data)
ncol(data)
• The colnames() function returns or sets the names
of the columns in a data frame.
• ncol() function in R Language is used to return the
number of columns of the object.
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data Cleaning
2) Modify Column Names:
• Let’s assume that we want to change these column names to a consecutive range with the
prefix “col”. Then, we can apply the colnames, paste0, and ncol functions as shown below.
colnames(data) <- paste0("col", 1:ncol(data)) # Modify all column names
data # Print updated data frame

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
3) Format Missing Values:
• In the R programming language, missing values are usually represented by NA. For
that reason, it is useful to convert all missing values to this NA format.
• In our specific example data frame, we have the problem that some missing values
are represented by blank character strings.
• If we want to assign NA values to those blank cells, we can use the following syntax
data[data == ""] <- NA
data

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
4) Remove Empty Rows & Columns:
• The syntax below demonstrates how to use the rowSums, is.na, and ncol functions
to remove only-NA rows.
data <- data[rowSums(is.na(data)) != ncol(data), ] # Drop empty rows
data # Print updated data frame

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
4) Remove Empty Rows & Columns:
• Similar to that, we can also exclude columns that contain only NA values.
data <- data[ , colSums(is.na(data)) != nrow(data), ] # Drop empty cols
data # Print updated data frame

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
5) Remove Rows with Missing Values:
• However, in case you have decided to remove all rows with one or more NA values,
you may use the na.omit function as shown below.
data <- na.omit(data) # Drop rows with missing vals
data # Print updated data frame

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
6) Remove Duplicates:
• we can apply the unique function to our data frame to remove duplicates.
data <- unique(data) # delete duplicate rows
data # Print updated data frame

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Cleaning
7) Modify Classes of Columns:
• The class of the columns of a data frame is another critical topic when it comes to data
cleaning.
• This example explains how to format each column to the most appropriate data type
automatically.
• Let’s first check the current classes of our data frame columns.
sapply(data, class) # Print classes of all columns
# col1 col2 col3
# "numeric" "character" "character“
• We can now use the type.convert function to change the column classes whenever it is
appropriate
data <- type.convert(data, as.is = TRUE)
sapply(data, class) # Print classes of all columns
# col1 col2 col3
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data Cleaning
8) Detect & Remove Outliers:
• One method to detect outliers is provided by the boxplot.stats function. The
following R code demonstrates how to test for outliers in our data frame column
col1
# Identify outliers in column
data$col1[data$col1 %in% boxplot.stats(data$col1)$out]
[1] 99999 #This value is obviously much higher than the other values in this column.
• Let’s assume that we have confirmed theoretically that the observation containing
this outlier should be removed. Then, we can apply the R code below
# Remove rows with outliers
data <- data[! data$col1 %in% boxplot.stats(data$col1)$out, ]
data
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data Cleaning
9) Remove Spaces in Character Strings:
• The manipulation of character strings is another important aspect of the data
cleaning process.
• This example demonstrates how to avoid blank spaces in the character strings of a
certain variable.
• For this task, we can use the gsub function as demonstrated below:
# Delete white space in character strings
data$col3 <- gsub(" ", "", data$col3)
data

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Recoding
• Recoding allows you to create new variables and to replace existing values of a
variables based on a criterion.
• Example: Let us consider a Data frame
df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
df
• Output:

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Recoding
• To recode, The easiest way is to use revalue() or mapvalues() from the plyr
package. These are defined in plyr package.
• Example:
df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
df
df$scode <- plyr::revalue(df$result, c("Win"="1", "Loss"="2")) #Creaing new Variable
df

df$result <- plyr::mapvalues(df$result, from = c("Win","Loss"), to = c("1", "0"))

df #Modify the existing variable

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Recoding
• It is also possible to recode using ifelse.
• Example:
df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
df
df$scode <- ifelse(df$result=="Win",1,0)
df #Creaing new Variable

df$result <- ifelse(df$result=="Win",1,0)

df #Modify the existing variable

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Merging
• Merging data is a common task in data analysis, especially when working with large
datasets.
• The merge function in R is a powerful tool that allows you to combine two or more
datasets based on shared variables.
• In R there are various ways to merge data frames, using the ‘merge()’ function
from base R, using the ‘dplyr’ package, and the ‘data.table’ package.
• Using ‘merge()’ from base R:
• The merge() function in base R helps us to combine two or more data frames based on
common columns.
• It performs various types of joins such as inner join, left join, right join, and full join.
• Syntax: merged_df <- merge(x,y,by = "common_column",..)
• x’ and ‘y’ are the data frames that you want to merge.
• ‘by’ specifies the common columns on which the merge will be performed.
• Additional arguments like ‘all.x’,all.y’ and ‘all’ control the type of join that is to be performed.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Merging
• Example:
• Consider two data frames ‘df1’ and ‘df2’
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(25, 30, 35, 40))

df2 <- data.frame(ID = c(2, 3, 4, 5),

Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(5000, 4000, 6000, 7000))

1. Inner join (default behavior):

inner_join <- merge(df1, df2, by = "ID")
print(inner_join)

The resulting ‘inner_join’ dataframe will only

include the common rows where ‘ID’ is present
in both ‘df1’ and ‘df2’.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Merging
• Example:

Df1-> df2->

2. Left join(‘all.x=TRUE’):
left_join <- merge(df1, df2, by = "ID“, all.x=TRUE)
print(left_join)

The resulting ‘left_join’ data frame will

include all rows from ‘df1’ and the matching
rows from ‘df2’. Non-matching rows from
‘df2’ will have an ‘NA’ value

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Merging
• Example:

Df1-> df2->

3. Right join(‘all.y=TRUE’):
join <- merge(df1, df2, by = "ID“, all.y=TRUE)
print(join)

The resulting ‘right_join’ data frame will include

all rows from ‘df2’ and the matching rows from
‘df1’. Non-matching rows from ‘df1’ will have
‘NA’ values.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Merging
• Example:

Df1-> df2->

4. Full outer join(‘all =TRUE’)

join <- merge(df1, df2, by = "ID“, all=TRUE)
print(join)

The resulting ‘full_join’ data frame will include

all rows from both ‘df1’ and ‘df2’.
Non-matching values will have ‘NA’ values.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Merging
5. Cross join(by=NULL)
join <- merge(df1, df2, by = NULL)
print(join)

A Cross Join also known as cartesian join results

in every row of one data frame is being joined to
every other row of another data frame.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Merging
• Using ‘dplyr’ Package:
• The primary function for merging in ‘dplyr’ is ‘join()’, which supports various types of joins.
• Syntax :
merged_df<- join(x,y,by="common_column",type="type_of_join")
• ‘x’ and ‘y’ are the data frames to be merged.
• ‘by’ specifies the common columns on which the merge is to be performed
• ‘type_of_join’ can be ‘inner’, ‘left’,’ right’ or ‘full’ to specify the type of join.
• Example:
• Consider two data frames ‘df1’ and ‘df2’
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(25, 30, 35, 40))

df2 <- data.frame(ID = c(2, 3, 4, 5),

Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(5000, 4000, 6000, 7000))
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data Merging
• Inner join:
inner_join <- dplyr::inner_join(df1, df2, by = "ID")
print(inner_join)

• Left join:
left_join <- dplyr::left_join(df1, df2, by = "ID")
print(left_join)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Data Merging
• Right join:
right_join <- dplyr::right_join(df1, df2, by = "ID")
print(right_join)

• Full outer join:

left_join <- dplyr::full_join(df1, df2, by = "ID")
print(full_join)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

R Cheat Sheet PDF
100% (1)
R Cheat Sheet PDF
38 pages
R Programming Cheatsheet
100% (1)
R Programming Cheatsheet
6 pages
Learning JavaScript Data Structures and Algorithms - Second Edition
From Everand
Learning JavaScript Data Structures and Algorithms - Second Edition
Loiane Groner
No ratings yet
R - A Practical Course
No ratings yet
R - A Practical Course
42 pages
Picnic Table Converts Benches
No ratings yet
Picnic Table Converts Benches
17 pages
Introduction To R Software: Sorting and Ordering
No ratings yet
Introduction To R Software: Sorting and Ordering
8 pages
1a Data Sorting
No ratings yet
1a Data Sorting
9 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Section 03
No ratings yet
Section 03
20 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
R Language PDF
100% (1)
R Language PDF
619 pages
R study material I
No ratings yet
R study material I
8 pages
data anlytics using r notes
No ratings yet
data anlytics using r notes
14 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
R Programming Cheat Sheet: Ata Tructures
No ratings yet
R Programming Cheat Sheet: Ata Tructures
2 pages
Week 1-3
No ratings yet
Week 1-3
17 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Base-R
No ratings yet
Base-R
9 pages
R Training by Emma Mba
No ratings yet
R Training by Emma Mba
68 pages
R Prog
No ratings yet
R Prog
27 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
Network Analysis and Visualization With R and Igraph
No ratings yet
Network Analysis and Visualization With R and Igraph
62 pages
Mydata - Read - CSV ("Nameofthedatafile - CSV") : Sorting A Data Frame
No ratings yet
Mydata - Read - CSV ("Nameofthedatafile - CSV") : Sorting A Data Frame
2 pages
R Cheatsheet Base R
No ratings yet
R Cheatsheet Base R
2 pages
Rbasics
No ratings yet
Rbasics
96 pages
1 - Introduction To Programming With R
No ratings yet
1 - Introduction To Programming With R
13 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Statistics With R Unit 1: Divya Arun Kumar
No ratings yet
Statistics With R Unit 1: Divya Arun Kumar
65 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
40 pages
Introduction to r
No ratings yet
Introduction to r
18 pages
WIN SEM (2022-23) CSE4027 ETH AP2022236000324 Reference Material I 25-Jan-2023 Module-1 Topic-3 - R Datatypes
No ratings yet
WIN SEM (2022-23) CSE4027 ETH AP2022236000324 Reference Material I 25-Jan-2023 Module-1 Topic-3 - R Datatypes
41 pages
STAT 04 Simplify Notes
No ratings yet
STAT 04 Simplify Notes
34 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
Tutorial 1
No ratings yet
Tutorial 1
29 pages
BDA Section 3
No ratings yet
BDA Section 3
33 pages
R-Unit 2
No ratings yet
R-Unit 2
81 pages
P-AI IA3
No ratings yet
P-AI IA3
13 pages
N2 Data in R
No ratings yet
N2 Data in R
7 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
91 pages
Lab 1 (with Answers)
No ratings yet
Lab 1 (with Answers)
44 pages
R Program Cheat Sheet 1
No ratings yet
R Program Cheat Sheet 1
2 pages
R Cheat Sheet 3 PDF
No ratings yet
R Cheat Sheet 3 PDF
2 pages
Cheat R Sheet
No ratings yet
Cheat R Sheet
5 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
Data - Table Tutorial (With 50 Examples) PDF
No ratings yet
Data - Table Tutorial (With 50 Examples) PDF
13 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Introduction To R
No ratings yet
Introduction To R
21 pages
Character Vectors: Letters Letters
No ratings yet
Character Vectors: Letters Letters
5 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Introduction to r Chap 2
No ratings yet
Introduction to r Chap 2
30 pages
CIND123 Swirl Lesson 15
No ratings yet
CIND123 Swirl Lesson 15
46 pages
Data in R
No ratings yet
Data in R
7 pages
R Programming 101 Part 1
No ratings yet
R Programming 101 Part 1
53 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
R-Tools-LAB
No ratings yet
R-Tools-LAB
31 pages
Week3 2020
No ratings yet
Week3 2020
20 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
IEP Examples
No ratings yet
IEP Examples
7 pages
Series: Ultra High Performance Vertical Machining Center
No ratings yet
Series: Ultra High Performance Vertical Machining Center
20 pages
Synchronous Machines
No ratings yet
Synchronous Machines
53 pages
His_Highness_the_Fifth_Prince_is_still_being_deceived
No ratings yet
His_Highness_the_Fifth_Prince_is_still_being_deceived
203 pages
PYRAMID SAIMIRA THEATRE LIMITED-kp
No ratings yet
PYRAMID SAIMIRA THEATRE LIMITED-kp
29 pages
Rapid TSH Immulite - 230524111657469
No ratings yet
Rapid TSH Immulite - 230524111657469
31 pages
What Is Financial Planning - PPT Presentation - Schools Universities
No ratings yet
What Is Financial Planning - PPT Presentation - Schools Universities
13 pages
CSR PPT
No ratings yet
CSR PPT
70 pages
Shelf Ready Packaging Toolkit 2017
No ratings yet
Shelf Ready Packaging Toolkit 2017
29 pages
Linux Fresher Resume
No ratings yet
Linux Fresher Resume
7 pages
Bucket 33 XS0231918060 56471 8333108
No ratings yet
Bucket 33 XS0231918060 56471 8333108
4 pages
Material Science Midterm
No ratings yet
Material Science Midterm
14 pages
HRFB010176092022 2 2023-01-02
No ratings yet
HRFB010176092022 2 2023-01-02
9 pages
What Is Biodiversity? 3.species Diversity
No ratings yet
What Is Biodiversity? 3.species Diversity
7 pages
Supplier Invoice Template Free Word Format
No ratings yet
Supplier Invoice Template Free Word Format
2 pages
Jobsite Inspection Checklist: OK Not Ok 1. Site Access Action Taken
No ratings yet
Jobsite Inspection Checklist: OK Not Ok 1. Site Access Action Taken
4 pages
Cadena de Suministro Del Huevo
No ratings yet
Cadena de Suministro Del Huevo
2 pages
F10 L1 English-IPA
No ratings yet
F10 L1 English-IPA
1 page
REPORT
No ratings yet
REPORT
1 page
Favorite Food
No ratings yet
Favorite Food
13 pages
I A and Border Guard Job Advert A
No ratings yet
I A and Border Guard Job Advert A
2 pages
Application For Issuing Original Enrolment Certificate and Id Card PDF
No ratings yet
Application For Issuing Original Enrolment Certificate and Id Card PDF
1 page
Plus 50 IIProduct Specification Sheet
No ratings yet
Plus 50 IIProduct Specification Sheet
4 pages
Strictly Private & Confidential: (Insert Name of The Company)
No ratings yet
Strictly Private & Confidential: (Insert Name of The Company)
5 pages
Rollout Session 4 - Overview of The SmartVA Questionnaire
No ratings yet
Rollout Session 4 - Overview of The SmartVA Questionnaire
25 pages
Unilab History
100% (7)
Unilab History
2 pages
C8HCh5 When People Rebel 1857 After
No ratings yet
C8HCh5 When People Rebel 1857 After
13 pages
Philippine Nut Industry, Inc. vs. Standard Brands, Inc..docfsdfsd
No ratings yet
Philippine Nut Industry, Inc. vs. Standard Brands, Inc..docfsdfsd
2 pages
EN Product-Info Axio-Observer Bio Rel2-0
No ratings yet
EN Product-Info Axio-Observer Bio Rel2-0
24 pages

Module III

Uploaded by

Module III

Uploaded by

Module - 3

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

print("sort the data in increasing order based on subjects")

print("sort the data in increasing order based on rollno")

print("sort the data in increasing order based on names")

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

# Printing data without duplicates using unique

# Printing data without duplicates using distinct

# remove duplicate rows in subjects column

# remove duplicate rows in namescolumn

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

df$result <- plyr::mapvalues(df$result, from = c("Win","Loss"), to = c("1", "0"))

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

df$result <- ifelse(df$result=="Win",1,0)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

df2 <- data.frame(ID = c(2, 3, 4, 5),

1. Inner join (default behavior):

The resulting ‘inner_join’ dataframe will only

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

The resulting ‘left_join’ data frame will

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

The resulting ‘right_join’ data frame will include

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

4. Full outer join(‘all =TRUE’)

The resulting ‘full_join’ data frame will include

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

A Cross Join also known as cartesian join results

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

df2 <- data.frame(ID = c(2, 3, 4, 5),

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

• Full outer join:

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

You might also like