2 Manipulating Processing Data
2 Manipulating Processing Data
PROCESSING DATA IN R
Pavan Kumar A
RESHAPING DATA - NEED
Reshaping data is a general practice in the data analysis and it is very tedious
task.
Data often has multiple levels of grouping and typically requires investigation
at multiple levels.
For example,
A web application that lets you easily extract tabular data/images/text from PDF files.
DATA WRANGLING TOOLS
Open Refine
Type: Desktop application
Technology: Java
License: Free
Author: Google Inc. (United States)
Links:
Website: http://code.google.com/p/google-refine/
http://code.google.com/p/googlerefine/wiki/DocumentationForDevelopers
Tutorials
https://github.com/OpenRefine/OpenRefine/wiki/External-Resources
DATA WRANGLING TOOLS
Open Refine
Input Formats supported: TSV, CSV, Excel (. xls and xlsx), JSON, XML and Google
Data documents.
Output Formats: TSV, CSV, Excel and in table
Types of Data source:
Upload a file from local system
Can provide URL (importing data from tables in web pages, in XML documents)
Features
Data cleaning, Data transformation, Creation of new fields
DATA WRANGLING TOOLS
Data Wrangler
Type: Web application
Technology: HTML
License: Free to use
Author: The Stanford Visualization Group (United States)
Links:
Website: http://vis.stanford.edu/wrangler/
Research: http://vis.stanford.edu/papers/wrangler
Interactive web application for transformation and cleaning
It combines direct manipulation of visualized data with automatic inference of relevant data
transformation.
DATA WRANGLING TOOLS
CSVkit
Type: Library
Technology: Python
License: MIT
Author: Christopher Groskopf
Links:
Repository: https://github.com/onyxfish/csvkit
Issues: https://github.com/onyxfish/csvkit/issues
Documentation: http://csvkit.rtfd.org/
Schemas: https://github.com/onyxfish/ffs
Python with pandas is in use in a wide variety of academic and commercial domains,
including Finance, Neuroscience, Economics, Statistics, Web Analytics, and more.
DATA WRANGLING TOOLS
Features of Pandas:
Tools for reading and writing data (CSV and
text files, Microsoft Excel, SQL databases)
merging and joining of data sets;
Flexible reshaping and pivoting of data sets;
A fast and efficient DataFrame object for data manipulation.
Aggregating or transforming data with a powerful group by engine allowing split-apply-
combine operations on data sets;
COMPARISON
Source: http://www.analyticsvidhya.com/blog/2015/05/infographic-quick-guide-sas-python/
WHY R FOR DATA WRANGLING
R PACKAGES FOR DATA WRANGLING
The sqldf: R package for running SQL Statements on R data frames
The tidyr: Easily makes Tidy Data with spread() and gather() Functions
In reality, you need long-format data much more commonly than wide-format
data.
For example
By default, melt has assumed that all columns with numeric values are
variables with values
MELTING DATA TO LONG FORMAT
Example 1: The melt() function
Applying some more arguments
Based on Identifier variables, the
whole dataset is reshaped.
Here, id.vars are “Month” and “Day”
and the remaining variables are
treated as measure.vars.
Data is never lost while reshaping
MELTING DATA TO LONG FORMAT
Example 1: The melt() function.
If you want to change the default names of “variable” and “value”, following
command is used.
Syntax
The arguments on the left side of the formula refers to the “id.vars” and the
arguments on the right side of the formula refers to the “measure.vars”.
Here, we are using long data format of Airquality dataset
CASTING DATA TO WIDE FORMAT
Example 1: The dcast() function
Dataset used : Long data format of Airquality dataset.
Here, we need to dcast the “Month” and “Day” (which are again id.vars) and
remaining are variable is the measures.vars.
CASTING DATA TO WIDE FORMAT Input
Example :
Example
unite(storms2, "date", year, month,
day, sep = "-")
THE TIDYR PACKAGE
The separate() function
It turns a single character column into
multiple columns.
Syntax
Example
separate(storms, date, c("year",
"month", "day"), sep = "-")
THE SQLDF PACKAGE
Many business users had to dealt to RDBMS previously.
In R, there is a package called “sqldf” for running sql statements and data
manipulation in R
To install
install.packages(“sqldf”)
To load :
library(sqldf)
THE SQLDF PACKAGE
Performing joins is more common in
SQL.
Left joins : Returns all records from
left table.
Right joins : Returns all records from
right table.
Inner joins : Returns records which
are matching among tables.
Full outer join: Returns all rows
from all tables, if rows are not
matching.
THE SQLDF PACKAGE
Example 1: select() function
The following two datasets are used
THE SQLDF PACKAGE
Example : sqldf package
Performing Inner Join
Select
Filter
Mutate
Arrange
Group_by and
Summarise
Data set used is storms data
THE DPLYR PACKAGE
Example : The select() function
The select() function keeps only the variables you
mention.
Select(data, ...)
Syntax
The command used for the following output
select(storms, storm, pressure)
select(storms, -storm)
THE DPLYR PACKAGE
Example : The filter() function
The filter() function return rows with matching
conditions.
filter(storms, wind >= 50,
Syntax filter(data, ...) storm %in% c("Alberto",
Command "Alex", "Allison"))
filter(storms, wind >= 50)
THE DPLYR PACKAGE
Example : The mutate() function
The mutate() function Derive new variables from existing variables.
The group_by() function takes an existing table and converts it into a grouped
table where operations are performed "by group".
Syntax
group_by(data, ...)
Command
pollution %>% group_by(city)
THE DPLYR PACKAGE
Example : The summarise() function.
The summarise() funciton Summarises multiple values to a single value.
Syntax
summarise(data, ...)
Command
Syntax of bind_cols
bind_cols(x, ...)
Syntax of bind_rows
bind_rows(x, ...)
THE DPLYR PACKAGE
Example : The bind() functions
Commands for bind_rows() and bind_cols()
bind_cols(y, z) bind_rows(y, z)
THE DPLYR PACKAGE intersect(y, z)
inner_join(x, y, by = NULL)
left_join(x, y, by = NULL)
right_join(x, y, by = NULL)
full_join(x, y, by = NULL)
THE DPLYR PACKAGE
Example1 : Left join
left_join(songs, artists, by = "name")
To use transform, you specify a data frame (as the first argument) and a set of
expressions that use variables within the data frame.
The transform function applies each expression to the data frame and then
returns the final data frame.
1. Suppose that you had some time series data where time was
measured in days, but you wanted to summarize the data by
month.
# numeric form
cut(x, breaks)
The cut function takes a numeric vector as input and returns a factor
BINNING DATA- CUT
Example for cut()
Lets create the hypothetical clinical data set here
BINNING DATA- CUT
We will apply cut command on the clinical.trail data frame to make age a
factor (Categorical value).
Lets see the structure of the data frame