0% found this document useful (0 votes)
102 views81 pages

2 Manipulating Processing Data

The document discusses various techniques for manipulating and processing data in R. It covers reshaping data from wide to long formats and vice versa using functions like gather() and spread() from the tidyr package. It also discusses merging datasets using merge(), cbind(), and rbind() functions. Additionally, it covers sorting and ordering data using functions like sort() and order(), as well as transposing data using t(). Finally, it provides an overview of some common data wrangling tools.

Uploaded by

naresh darapu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views81 pages

2 Manipulating Processing Data

The document discusses various techniques for manipulating and processing data in R. It covers reshaping data from wide to long formats and vice versa using functions like gather() and spread() from the tidyr package. It also discusses merging datasets using merge(), cbind(), and rbind() functions. Additionally, it covers sorting and ordering data using functions like sort() and order(), as well as transposing data using t(). Finally, it provides an overview of some common data wrangling tools.

Uploaded by

naresh darapu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

MANIPULATING AND

PROCESSING DATA IN R

Pavan Kumar A
RESHAPING DATA - NEED
 Reshaping data is a general practice in the data analysis and it is very tedious
task.
 Data often has multiple levels of grouping and typically requires investigation
at multiple levels.
 For example,

 From a long term clinical study we may be interested in investigating


relationships over time, or between times or patients or treatments.
 Performing these investigations fluently requires the data to be reshaped in
different ways, but most software packages make it difficult to generalize
these tasks and code needs to be written for each specific case.
MERGING DATASETS IN R
 Similar datasets obtained from the same data sources, need to be merged
together for further processing.
 R provides following functions for merging different data sets
 The merge() function : Used to merge the data contained in different data frames on
the basis of common columns
 The cbind() function: Used to add the columns of datasets having an equal set and
identical order of rows.
 The rbind() function: Used to add rows in datasets having equal number of
columns
MERGING DATASETS IN R- MERGE()
 The merge() function combines the data of two data frames on the basis of the
existence of a common column between the two.

 Following are the arguments taken by merge() funciton


 x: specifies a data frame
 y: specifies a data frame
 by, by.x, by.y: specifies the names of the common columns in both x and y
MERGING DATASETS IN R- MERGE()
 Example merge is shown.
 Data Frames mydata1 and mydata2 are merged based on common column “ID”
MERGING DATASETS IN R- MERGE()
 Example merge is shown.
 Data Frames mydata1 and mydata2 are merged based on different columns.

Combines mydata1 and mydata2 on the


basis of “ID” and “StudentID” columns
respectively
MERGING DATASETS IN R- MERGE()
 Example merge is shown.
 Data Frames mydata1 and mydata2 are merged based on two common
columns.

Combines the data of mydata1 and


mydata2 on the basis of “ID” and “Names”
columns respectively
MERGING DATASETS IN R- CBIND()
 The cbind() function is used to bind the columns of two datasets.

Combines “ID” and “Names” of mydata1 and “English” and


“Maths” columns of mydata2

Combines “ID”, “Names” and “Social "of mydata1 and


“English” and “Maths” columns of mydata2
MERGING DATASETS IN R- RBIND()
 The rbind() function is used to bind the rows of two datasets.
 The rbind() function combines vector, matrix or data frame by rows.
SORTING DATA
 R provides various functions that allow you to define the order of your data in
a data structure.
 The following functions are used for sorting the data.

 sort() : Used to sort the values contained in a vector


 order(): Used to organize/arrange values or columns in a dataset
 Example : Sorting and Reverse Sorting of Vector
ORDERING DATA
 The order() function is
used to organize or
arrange values or
columns in a dataset.
REVERSE ORDER
 You can reverse the order of
the data contained in a
column of a data frame in 2
ways.
 By using
decreasing=TRUE, in
the order() function
 By using - (minus) before
the column name
TRANSPOSING THE DATA
 You can use t() function to transpose a matrix or a data frame
 This function converts rows to columns and columns to rows.

Converts rows to columns and


columns to rows
DATA WRANGLING TOOLS
DATA WRANGLING TOOLS
 Some of the freely available data wrangling tools are
 Tabula : Extracting tabular data from PDF’s mainly tables.
 OpenRefine : Tool for working with messy data, cleaning it up, transforming it from
one format into another.
 “R” packages : R is a open source programming/scripting language that's useful both
for statistics and data science.
 DataWrangler: Data Wrangler is an interactive tool for data cleaning and
transformation. It is a web application
 CSVkit: Suite of utilities for converting to and working with CSV files
 Python: Pandas package for data cleaning.
 Mr. Data Converter: It will convert your Excel data into one of several web-friendly
formats, including HTML, JSON and XML.
DATA WRANGLING TOOLS
 Tabula
 Type: Desktop application
 Technology: Ruby, JavaScript
 License: Open source
 Author: Manuel Aristarán, Mike Tigas and Jeremy B. Merrill
 Links:
 Website: http://tabula.technology/

 A web application that lets you easily extract tabular data/images/text from PDF files.
DATA WRANGLING TOOLS
 Open Refine
 Type: Desktop application
 Technology: Java
 License: Free
 Author: Google Inc. (United States)
 Links:
 Website: http://code.google.com/p/google-refine/

 Documentation for users:


http://code.google.com/p/googlerefine/wiki/DocumentationForUsers
 Documentation for developers:

http://code.google.com/p/googlerefine/wiki/DocumentationForDevelopers
 Tutorials

https://github.com/OpenRefine/OpenRefine/wiki/External-Resources
DATA WRANGLING TOOLS
 Open Refine
 Input Formats supported: TSV, CSV, Excel (. xls and xlsx), JSON, XML and Google
Data documents.
 Output Formats: TSV, CSV, Excel and in table
 Types of Data source:
 Upload a file from local system

 Can provide URL (importing data from tables in web pages, in XML documents)

 Copy and Paste data

 Provide link of Google Docs.

 Features
 Data cleaning, Data transformation, Creation of new fields
DATA WRANGLING TOOLS
 Data Wrangler
 Type: Web application
 Technology: HTML
 License: Free to use
 Author: The Stanford Visualization Group (United States)
 Links:
 Website: http://vis.stanford.edu/wrangler/
 Research: http://vis.stanford.edu/papers/wrangler
 Interactive web application for transformation and cleaning

 It combines direct manipulation of visualized data with automatic inference of relevant data
transformation.
DATA WRANGLING TOOLS
 CSVkit
 Type: Library
 Technology: Python
 License: MIT
 Author: Christopher Groskopf
 Links:
 Repository: https://github.com/onyxfish/csvkit

 Issues: https://github.com/onyxfish/csvkit/issues

 Documentation: http://csvkit.rtfd.org/

 Schemas: https://github.com/onyxfish/ffs

 CSVkit is a suite of utilities for converting to and working with CSV


DATA WRANGLING TOOLS
 Features of CSVkit
 Convert Excel to CSV
 Convert JSON to CSV
 csvcut: data scalpel
 csvstat: statistics on the data
 csvgrep: find the data you need
 csvsort: ordering
 csvjoin: merging related data
 csvstack: combining subsets
DATA WRANGLING TOOLS
 Pandas: Python Data Analysis Library
 Type: Library
 Technology: Python
 License: Open source
 Links:
 Website: http://pandas.pydata.org/

 Python with pandas is in use in a wide variety of academic and commercial domains,
including Finance, Neuroscience, Economics, Statistics, Web Analytics, and more.
DATA WRANGLING TOOLS
 Features of Pandas:
 Tools for reading and writing data (CSV and
text files, Microsoft Excel, SQL databases)
 merging and joining of data sets;
 Flexible reshaping and pivoting of data sets;
 A fast and efficient DataFrame object for data manipulation.
 Aggregating or transforming data with a powerful group by engine allowing split-apply-
combine operations on data sets;
COMPARISON

Source: http://www.analyticsvidhya.com/blog/2015/05/infographic-quick-guide-sas-python/
WHY R FOR DATA WRANGLING
R PACKAGES FOR DATA WRANGLING
 The sqldf: R package for running SQL Statements on R data frames
 The tidyr: Easily makes Tidy Data with spread() and gather() Functions

 The plyr & dplyr: The split-apply-combine strategy for R.

 The reshape2: For restructure and aggregate data.

 The Data.table: Speed with large data sets

 The Stringr: Package for text manipulation

 To use the above packages; install and load


 Installing:
install.packages(“Package_Name”)
 Loading
library(Package_Name)
RESHAPING THE DATA IN R
CONVERTING DATA TO WIDE OR LONG FORMATS
 Wide and Long data
 Wide data has more
number of columns
than rows
 Long data has more
number of rows than
columns
 We can convert from One
form to another form in R
CONVERTING DATA TO WIDE OR LONG FORMATS
 Wide data has a column for each variable value
 Long-format data has a (one) column for all possible variable types and
another column for the values of those variables.
 It is not necessarily 2 columns; it can be more than that
 In some data analysis, you need long data format and vice-versa.

 In reality, you need long-format data much more commonly than wide-format
data.
 For example

 The ggplot2 requires wide-format data.


 The plyr requires long-format data, and most modelling functions
(such as lm(), glm() require long-format data.
 But people often find it easier to record their data in wide format.
CONVERTING DATA TO WIDE OR LONG FORMATS
 R provides the reshape2() package to convert data into wide to long format and
vice-versa.
 Two functions we use
 Use melt() function to convert wide data to long format
 Use dcast() function to convert long data to wide format
 When converting data from long to wider format or vice-versa, it is important
to understand the identifier variables and measured variables.
 Identifier variables identifies the observations
 Measured variables represents the observed measurements
MELTING DATA TO LONG FORMAT
 The melt() function is used for converting the data from wide format to long
format.
 The melt() function contained in reshape2 package.

 So, reshape2 package should be installed and loaded.

 Sample example is shown here.


id time variable value
1 1 x1 5
1 2 x1 3
id time x1 x2 2 1 x1 6
1 1 5 6 melt() 2 2 x1 2
1 2 3 5 1 1 x2 6
2 1 6 1 cast()
1 2 x2 5
2 2 2 4 2 1 x2 1
2 2 x2 4
MELTING DATA TO LONG FORMAT
 Example 1 : The melt() function.
 We have considered “airquality” dataset
 By default settings for melt funciton
 Command is melt(AQsample)

 By default, melt has assumed that all columns with numeric values are
variables with values
MELTING DATA TO LONG FORMAT
 Example 1: The melt() function
 Applying some more arguments
 Based on Identifier variables, the
whole dataset is reshaped.
 Here, id.vars are “Month” and “Day”
and the remaining variables are
treated as measure.vars.
 Data is never lost while reshaping
MELTING DATA TO LONG FORMAT
 Example 1: The melt() function.
 If you want to change the default names of “variable” and “value”, following
command is used.

 Syntax

melt(data, id.vars, measure.vars, variable.name = "variable",


na.rm = FALSE, value.name = "value”)
MELTING DATA TO LONG FORMAT
 Example 2 : The melt() function

1. Here, data frame, named dataMelt is created.


2. Here, melt() function, explicitly specifies ID
variables, source columns, destination columns
and the measurement column
MELTING DATA TO LONG FORMAT
 The melt() function, by default, considers all categorical variables into
identifier variables.
 We can also change the default settings

Here, we are applying


additional parameters to
specify the identifiers,
Measurement Variable &
Value names
CASTING DATA TO WIDE FORMAT
 In reshape2 there are multiple cast functions.
 Since you will most commonly work with data.frame objects,
the dcast() function is used here.
 There is also acast() to return a vector, matrix, or array.
 The dcast() function uses a formula to describe the shape of the data.

 The arguments on the left side of the formula refers to the “id.vars” and the
arguments on the right side of the formula refers to the “measure.vars”.
 Here, we are using long data format of Airquality dataset
CASTING DATA TO WIDE FORMAT
 Example 1: The dcast() function
 Dataset used : Long data format of Airquality dataset.
 Here, we need to dcast the “Month” and “Day” (which are again id.vars) and
remaining are variable is the measures.vars.
CASTING DATA TO WIDE FORMAT Input

 Exampe 1: The dcast() function

 Check with the following formula


month ~ variable
CASTING DATA TO WIDE FORMAT
 Example 2 : Sample dataset
 Formula here is Subject +
Sex ~ Condition
 The id.vars are Subject and
Sex
 The measure.vars are
Condition
THE TIDYR PACKAGE
 The tidyr is new package that makes it easy to “tidy” your data
 Main Features (Functions)

 Gather and Spread


 Unite and Separate
 To install
Install.packages(“tidyr”)
 To load
librarty(“tidyr”)
 Help
help(package=“tidyr")
THE TIDYR PACKAGE
 The gather() function
 The gather() function takes multiple columns and collapses into key-value
pairs, duplicating all other columns as needed.
 The gather() function can be used when the columns are not unique variables.

 Example :

 Dataset used is TB data. Number of TB cases in 3 different countries


 Here, 3 rows and 4 columns.

 Column names [2:4] are Year values.

 So, we can apply gather() to these columns


under one column (For example : Year)
THE TIDYR PACKAGE
 The gather() function
 Syntax
gather(data, key, value, …, na.rm = FALSE/TRUE
 Example : The gather() function
 The following command is used to convert the data.
gather(cases, “Year", "n", 2:4)
 cases : Dataset Name
 Year: Key
 n: value
 2:4 : Specifications of columns (from 2nd column to 4th
column, the values should be gathered)
THE TIDYR PACKAGE
 The spread() function
 The spread() function spreads a key-value
pair across multiple columns.
 Dataset used here is the pollution data,
which has 6 rows and 3 columns.
 We can spread the values (amount) in
two different columns (For example:
Large and Small)
THE TIDYR PACKAGE
 The spread() function
 Syntax
spread(data, key, value)

 Example : The spread() function


 Command is follows
spread(pollution, size, amount)
 Pollution : data
 Size: key
 Amount : value
THE TIDYR PACKAGE
 The unite() function
 It is convenience function to paste together
multiple columns into one.
 Syntax
unite(data, col, ..., sep = "_“)

 Example
unite(storms2, "date", year, month,
day, sep = "-")
THE TIDYR PACKAGE
 The separate() function
 It turns a single character column into
multiple columns.
 Syntax

separate(data, col, into, sep = “_/:/;/grep”)

 Example
 separate(storms, date, c("year",
"month", "day"), sep = "-")
THE SQLDF PACKAGE
 Many business users had to dealt to RDBMS previously.
 In R, there is a package called “sqldf” for running sql statements and data
manipulation in R
 To install
install.packages(“sqldf”)
 To load :
library(sqldf)
THE SQLDF PACKAGE
 Performing joins is more common in
SQL.
 Left joins : Returns all records from
left table.
 Right joins : Returns all records from
right table.
 Inner joins : Returns records which
are matching among tables.
 Full outer join: Returns all rows
from all tables, if rows are not
matching.
THE SQLDF PACKAGE
 Example 1: select() function
 The following two datasets are used
THE SQLDF PACKAGE
 Example : sqldf package
 Performing Inner Join

 Performing Inner Join and


where clause in it

 Sub setting the data


sqldf("select id from
df1“)
THE DPLYR PACKAGE
 A package that transforms tabular data.
 Functions in dplyr package

 Select
 Filter
 Mutate
 Arrange
 Group_by and
 Summarise
 Data set used is storms data
THE DPLYR PACKAGE
 Example : The select() function
 The select() function keeps only the variables you
mention.
Select(data, ...)
 Syntax
 The command used for the following output
select(storms, storm, pressure)
select(storms, -storm)
THE DPLYR PACKAGE
 Example : The filter() function
 The filter() function return rows with matching
conditions.
filter(storms, wind >= 50,
 Syntax filter(data, ...) storm %in% c("Alberto",
 Command "Alex", "Allison"))
filter(storms, wind >= 50)
THE DPLYR PACKAGE
 Example : The mutate() function
 The mutate() function Derive new variables from existing variables.

 Syntax mutate(data, ...)


 Command
mutate(storms, ratio = pressure /wind)
THE DPLYR PACKAGE
 Example : The arrange() function
 The arrange() function Arrange rows by variables.

 Syntax arrange(data, ...)


 Command
arrange(storms, wind)
arrange(storms, desc(wind))
THE DPLYR PACKAGE
 Example : The group_by() function
 The group_by() function Group a table by one or more variables.

 The group_by() function takes an existing table and converts it into a grouped
table where operations are performed "by group".
 Syntax
group_by(data, ...)
 Command
pollution %>% group_by(city)
THE DPLYR PACKAGE
 Example : The summarise() function.
 The summarise() funciton Summarises multiple values to a single value.

 Syntax
summarise(data, ...)

data : Data frame or Table



 … : Name-value pairs of summary functions like min(), mean(), max() etc.
 Applying various summary functions on Pollution data

 Command

pollution %>% summarise(median = median(amount),


variance = var(amount))
THE DPLYR PACKAGE
 Applying various summary functions on Pollution data

pollution %>% summarise(mean = mean(amount), sum


= sum(amount), n = n())

pollution %>% group_by(city) %>%


summarise(mean = mean(amount), sum
= sum(amount), n = n())
THE DPLYR PACKAGE
 Example : The bind() function
 The bind() efficiently bind multiple data frames by row and column.

 It has two functions under this

 The bind_cols() and bind_rows() function


 The bind_cols() efficiently bind multiple data frames by columns

 The bind_rows() efficiently bind multiple data frames by columns

 Syntax of bind_cols
bind_cols(x, ...)

 Syntax of bind_rows
bind_rows(x, ...)
THE DPLYR PACKAGE
 Example : The bind() functions
 Commands for bind_rows() and bind_cols()

bind_cols(y, z) bind_rows(y, z)
THE DPLYR PACKAGE intersect(y, z)

 Example : Set Operations


 There are four functions under Set
Operations in dplyr package
 The intersect( ) function
union(y, z)
 The union( ) function
 The setdiff( ) function
 The setequal( ) function
 Syntax’s

intersect(x, y, ...) setdiff(y, z)


union(x, y, ...)
setdiff(x, y, ...)
setequal(x, y, ...)
THE DPLYR PACKAGE
 Example : The join operations
 Types of joins in the dplr package along with the syntax

 inner_join(x, y, by = NULL)
 left_join(x, y, by = NULL)
 right_join(x, y, by = NULL)
 full_join(x, y, by = NULL)
THE DPLYR PACKAGE
 Example1 : Left join
left_join(songs, artists, by = "name")

 Example2 : Left join


left_join(songs2, artists2, by = c("first", "last"))
THE DPLYR PACKAGE
 Example : The inner join
inner_join(songs, artists, by = "name")
TRANSFORMATIONS
TRANSFORMATIONS : REASSIGNING VARIABLE
 Reassigning Variables:
 It’s also possible to make other changes to data frames.

 For example, suppose that we wanted to define a new column (midpoint


variable that is the mean of the high and low price.)
 We can add this variable with the same notation:

> dow30$mid <- (dow30$High + dow30$Low)/2


> names(dow30)
[1] "symbol" "Date" "Open" "High" "Low"
[6] "Close" "Volume" "Adj.Close" "mid"
TRANSFORMATIONS
 The transform() function : Function used for changing the number of variables
in a data frame
 Syntax:
transform(data, ...)

 To use transform, you specify a data frame (as the first argument) and a set of
expressions that use variables within the data frame.

 The transform function applies each expression to the data frame and then
returns the final data frame.

> dow30.transformed <- transform(dow30,


Date=as.Date(Date), mid = (High + Low)/2)
APPLYING A FUNCTION TO EACH ELEMENT OF AN OBJECT
 Transforming data is applying a common function to set of objects and
returning a new set of transformed objects.
 The base R library includes set of different functions for doing this.

 Applying a function to an array or matrix


 To apply a function to parts of an array (or matrix), use the apply function:

apply(X, MARGIN, FUN, ...)

 X is an array (or matrix) to which function is applied


 FUN is the function that is applied
 MARGIN Dimensions of the array to which you would like to apply a function
APPLYING A FUNCTION TO AN ARRAY

 Sample example for applying a function to an array or matrix

Here, we have created the matrix called as “x”


with dimensions 5 rows and 4 columns

Now lets show how apply works.


We will use function max to get the highest
numbers in the matrix
APPLYING A FUNCTION TO LIST OR VECTOR
 To apply a function to each element in a vector or a list and return a list, you
can use the function lapply
 Syntax
lapply(X, FUNC, ...)

 The function lapply requires two arguments:


 X : Name of the List or Vector
 FUNC : Name of the function to be applied on List or Vector

 You may specify additional arguments that will be passed to FUNC.


APPLYING A FUNCTION TO LIST OR VECTOR
 Simple example of how to use lapply
 Lets create the list of 5 elements and apply
some function on the list created.
APPLYING A FUNCTION TO A DATA FRAME
 You can apply a function to a data frame, and the function will be applied to
each vector in the data frame.
 Example:
BINNING DATA
 Another common data transformation is to group a set of
observations into bins (groups) based on value of specific variables.
 For example

1. Suppose that you had some time series data where time was
measured in days, but you wanted to summarize the data by
month.

 There are several functions available for binning numeric data in


R.
BINNING DATA- CUT
 Inmany data analysis settings, it might be useful to break
up a continuous variable such as age into a categorical
variable.
 Or, you might want to classify a categorical variable like
year into a larger bin, such as 1990-2000.
 The cut function in R makes this task simple!
BINNING DATA- CUT
 The function cut is useful for taking a continuous variable and splitting it into
discrete pieces
 Here is the default form of cut for use with numeric vectors:

# numeric form
cut(x, breaks)

 There is also a version of cut for manipulating Date objects:


# Date form
cut(x, breaks, start.on.monday = TRUE)

 The cut function takes a numeric vector as input and returns a factor
BINNING DATA- CUT
 Example for cut()
 Lets create the hypothetical clinical data set here
BINNING DATA- CUT
 We will apply cut command on the clinical.trail data frame to make age a
factor (Categorical value).
 Lets see the structure of the data frame

 Applying cut() on the clinical.trial$age (# numeric form)


BINNING DATA- CUT
 Applying cut() on the clinical.trial$year.enroll (#Factor)
 Here, year.enroll column is a categorical data (CD). So we have to convert
CD to numeric data and apply cut() command
DATA CLEANING
 Some of the data sets contain values like 997, 998, and 999 which are not
actual values there might be duplicate records in the data.

 Finding and Removing Duplicates


 Data sources often contain duplicate values.
 It’s a good idea to check for duplicates in your data
 R provides some useful functions for detecting duplicate values.
THANK YOU !!!

You might also like