0% found this document useful (0 votes)

102 views81 pages

2 Manipulating Processing Data

The document discusses various techniques for manipulating and processing data in R. It covers reshaping data from wide to long formats and vice versa using functions like gather() and spread() from the tidyr package. It also discusses merging datasets using merge(), cbind(), and rbind() functions. Additionally, it covers sorting and ordering data using functions like sort() and order(), as well as transposing data using t(). Finally, it provides an overview of some common data wrangling tools.

Uploaded by

naresh darapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views81 pages

2 Manipulating Processing Data

Uploaded by

naresh darapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

MANIPULATING AND

PROCESSING DATA IN R

Pavan Kumar A
RESHAPING DATA - NEED
 Reshaping data is a general practice in the data analysis and it is very tedious
task.
 Data often has multiple levels of grouping and typically requires investigation
at multiple levels.
 For example,

 From a long term clinical study we may be interested in investigating

relationships over time, or between times or patients or treatments.
 Performing these investigations fluently requires the data to be reshaped in
different ways, but most software packages make it difficult to generalize
these tasks and code needs to be written for each specific case.
MERGING DATASETS IN R
 Similar datasets obtained from the same data sources, need to be merged
together for further processing.
 R provides following functions for merging different data sets
 The merge() function : Used to merge the data contained in different data frames on
the basis of common columns
 The cbind() function: Used to add the columns of datasets having an equal set and
identical order of rows.
 The rbind() function: Used to add rows in datasets having equal number of
columns
MERGING DATASETS IN R- MERGE()
 The merge() function combines the data of two data frames on the basis of the
existence of a common column between the two.

 Following are the arguments taken by merge() funciton

 x: specifies a data frame
 y: specifies a data frame
 by, by.x, by.y: specifies the names of the common columns in both x and y
MERGING DATASETS IN R- MERGE()
 Example merge is shown.
 Data Frames mydata1 and mydata2 are merged based on common column “ID”
MERGING DATASETS IN R- MERGE()
 Example merge is shown.
 Data Frames mydata1 and mydata2 are merged based on different columns.

Combines mydata1 and mydata2 on the

basis of “ID” and “StudentID” columns
respectively
MERGING DATASETS IN R- MERGE()
 Example merge is shown.
 Data Frames mydata1 and mydata2 are merged based on two common
columns.

Combines the data of mydata1 and

mydata2 on the basis of “ID” and “Names”
columns respectively
MERGING DATASETS IN R- CBIND()
 The cbind() function is used to bind the columns of two datasets.

Combines “ID” and “Names” of mydata1 and “English” and

“Maths” columns of mydata2

Combines “ID”, “Names” and “Social "of mydata1 and

“English” and “Maths” columns of mydata2
MERGING DATASETS IN R- RBIND()
 The rbind() function is used to bind the rows of two datasets.
 The rbind() function combines vector, matrix or data frame by rows.
SORTING DATA
 R provides various functions that allow you to define the order of your data in
a data structure.
 The following functions are used for sorting the data.

 sort() : Used to sort the values contained in a vector

 order(): Used to organize/arrange values or columns in a dataset
 Example : Sorting and Reverse Sorting of Vector
ORDERING DATA
 The order() function is
used to organize or
arrange values or
columns in a dataset.
REVERSE ORDER
 You can reverse the order of
the data contained in a
column of a data frame in 2
ways.
 By using
decreasing=TRUE, in
the order() function
 By using - (minus) before
the column name
TRANSPOSING THE DATA
 You can use t() function to transpose a matrix or a data frame
 This function converts rows to columns and columns to rows.

Converts rows to columns and

columns to rows
DATA WRANGLING TOOLS
DATA WRANGLING TOOLS
 Some of the freely available data wrangling tools are
 Tabula : Extracting tabular data from PDF’s mainly tables.
 OpenRefine : Tool for working with messy data, cleaning it up, transforming it from
one format into another.
 “R” packages : R is a open source programming/scripting language that's useful both
for statistics and data science.
 DataWrangler: Data Wrangler is an interactive tool for data cleaning and
transformation. It is a web application
 CSVkit: Suite of utilities for converting to and working with CSV files
 Python: Pandas package for data cleaning.
 Mr. Data Converter: It will convert your Excel data into one of several web-friendly
formats, including HTML, JSON and XML.
DATA WRANGLING TOOLS
 Tabula
 Type: Desktop application
 Technology: Ruby, JavaScript
 License: Open source
 Author: Manuel Aristarán, Mike Tigas and Jeremy B. Merrill
 Links:
 Website: http://tabula.technology/

 A web application that lets you easily extract tabular data/images/text from PDF files.
DATA WRANGLING TOOLS
 Open Refine
 Type: Desktop application
 Technology: Java
 License: Free
 Author: Google Inc. (United States)
 Links:
 Website: http://code.google.com/p/google-refine/

 Documentation for users:

http://code.google.com/p/googlerefine/wiki/DocumentationForUsers
 Documentation for developers:

http://code.google.com/p/googlerefine/wiki/DocumentationForDevelopers
 Tutorials

https://github.com/OpenRefine/OpenRefine/wiki/External-Resources
DATA WRANGLING TOOLS
 Open Refine
 Input Formats supported: TSV, CSV, Excel (. xls and xlsx), JSON, XML and Google
Data documents.
 Output Formats: TSV, CSV, Excel and in table
 Types of Data source:
 Upload a file from local system

 Can provide URL (importing data from tables in web pages, in XML documents)

 Copy and Paste data

 Provide link of Google Docs.

 Features
 Data cleaning, Data transformation, Creation of new fields
DATA WRANGLING TOOLS
 Data Wrangler
 Type: Web application
 Technology: HTML
 License: Free to use
 Author: The Stanford Visualization Group (United States)
 Links:
 Website: http://vis.stanford.edu/wrangler/
 Research: http://vis.stanford.edu/papers/wrangler
 Interactive web application for transformation and cleaning

 It combines direct manipulation of visualized data with automatic inference of relevant data
transformation.
DATA WRANGLING TOOLS
 CSVkit
 Type: Library
 Technology: Python
 License: MIT
 Author: Christopher Groskopf
 Links:
 Repository: https://github.com/onyxfish/csvkit

 Issues: https://github.com/onyxfish/csvkit/issues

 Documentation: http://csvkit.rtfd.org/

 Schemas: https://github.com/onyxfish/ffs

 CSVkit is a suite of utilities for converting to and working with CSV

DATA WRANGLING TOOLS
 Features of CSVkit
 Convert Excel to CSV
 Convert JSON to CSV
 csvcut: data scalpel
 csvstat: statistics on the data
 csvgrep: find the data you need
 csvsort: ordering
 csvjoin: merging related data
 csvstack: combining subsets
DATA WRANGLING TOOLS
 Pandas: Python Data Analysis Library
 Type: Library
 Technology: Python
 License: Open source
 Links:
 Website: http://pandas.pydata.org/

 Python with pandas is in use in a wide variety of academic and commercial domains,
including Finance, Neuroscience, Economics, Statistics, Web Analytics, and more.
DATA WRANGLING TOOLS
 Features of Pandas:
 Tools for reading and writing data (CSV and
text files, Microsoft Excel, SQL databases)
 merging and joining of data sets;
 Flexible reshaping and pivoting of data sets;
 A fast and efficient DataFrame object for data manipulation.
 Aggregating or transforming data with a powerful group by engine allowing split-apply-
combine operations on data sets;
COMPARISON

Source: http://www.analyticsvidhya.com/blog/2015/05/infographic-quick-guide-sas-python/
WHY R FOR DATA WRANGLING
R PACKAGES FOR DATA WRANGLING
 The sqldf: R package for running SQL Statements on R data frames
 The tidyr: Easily makes Tidy Data with spread() and gather() Functions

 The plyr & dplyr: The split-apply-combine strategy for R.

 The reshape2: For restructure and aggregate data.

 The Data.table: Speed with large data sets

 The Stringr: Package for text manipulation

 To use the above packages; install and load

 Installing:
install.packages(“Package_Name”)
 Loading
library(Package_Name)
RESHAPING THE DATA IN R
CONVERTING DATA TO WIDE OR LONG FORMATS
 Wide and Long data
 Wide data has more
number of columns
than rows
 Long data has more
number of rows than
columns
 We can convert from One
form to another form in R
CONVERTING DATA TO WIDE OR LONG FORMATS
 Wide data has a column for each variable value
 Long-format data has a (one) column for all possible variable types and
another column for the values of those variables.
 It is not necessarily 2 columns; it can be more than that
 In some data analysis, you need long data format and vice-versa.

 In reality, you need long-format data much more commonly than wide-format
data.
 For example

 The ggplot2 requires wide-format data.

 The plyr requires long-format data, and most modelling functions
(such as lm(), glm() require long-format data.
 But people often find it easier to record their data in wide format.
CONVERTING DATA TO WIDE OR LONG FORMATS
 R provides the reshape2() package to convert data into wide to long format and
vice-versa.
 Two functions we use
 Use melt() function to convert wide data to long format
 Use dcast() function to convert long data to wide format
 When converting data from long to wider format or vice-versa, it is important
to understand the identifier variables and measured variables.
 Identifier variables identifies the observations
 Measured variables represents the observed measurements
MELTING DATA TO LONG FORMAT
 The melt() function is used for converting the data from wide format to long
format.
 The melt() function contained in reshape2 package.

 So, reshape2 package should be installed and loaded.

 Sample example is shown here.

id time variable value
1 1 x1 5
1 2 x1 3
id time x1 x2 2 1 x1 6
1 1 5 6 melt() 2 2 x1 2
1 2 3 5 1 1 x2 6
2 1 6 1 cast()
1 2 x2 5
2 2 2 4 2 1 x2 1
2 2 x2 4
MELTING DATA TO LONG FORMAT
 Example 1 : The melt() function.
 We have considered “airquality” dataset
 By default settings for melt funciton
 Command is melt(AQsample)

 By default, melt has assumed that all columns with numeric values are
variables with values
MELTING DATA TO LONG FORMAT
 Example 1: The melt() function
 Applying some more arguments
 Based on Identifier variables, the
whole dataset is reshaped.
 Here, id.vars are “Month” and “Day”
and the remaining variables are
treated as measure.vars.
 Data is never lost while reshaping
MELTING DATA TO LONG FORMAT
 Example 1: The melt() function.
 If you want to change the default names of “variable” and “value”, following
command is used.

 Syntax

melt(data, id.vars, measure.vars, variable.name = "variable",

na.rm = FALSE, value.name = "value”)
MELTING DATA TO LONG FORMAT
 Example 2 : The melt() function

1. Here, data frame, named dataMelt is created.

2. Here, melt() function, explicitly specifies ID
variables, source columns, destination columns
and the measurement column
MELTING DATA TO LONG FORMAT
 The melt() function, by default, considers all categorical variables into
identifier variables.
 We can also change the default settings

Here, we are applying

additional parameters to
specify the identifiers,
Measurement Variable &
Value names
CASTING DATA TO WIDE FORMAT
 In reshape2 there are multiple cast functions.
 Since you will most commonly work with data.frame objects,
the dcast() function is used here.
 There is also acast() to return a vector, matrix, or array.
 The dcast() function uses a formula to describe the shape of the data.

 The arguments on the left side of the formula refers to the “id.vars” and the
arguments on the right side of the formula refers to the “measure.vars”.
 Here, we are using long data format of Airquality dataset
CASTING DATA TO WIDE FORMAT
 Example 1: The dcast() function
 Dataset used : Long data format of Airquality dataset.
 Here, we need to dcast the “Month” and “Day” (which are again id.vars) and
remaining are variable is the measures.vars.
CASTING DATA TO WIDE FORMAT Input

 Exampe 1: The dcast() function

 Check with the following formula

month ~ variable
CASTING DATA TO WIDE FORMAT
 Example 2 : Sample dataset
 Formula here is Subject +
Sex ~ Condition
 The id.vars are Subject and
Sex
 The measure.vars are
Condition
THE TIDYR PACKAGE
 The tidyr is new package that makes it easy to “tidy” your data
 Main Features (Functions)

 Gather and Spread

 Unite and Separate
 To install
Install.packages(“tidyr”)
 To load
librarty(“tidyr”)
 Help
help(package=“tidyr")
THE TIDYR PACKAGE
 The gather() function
 The gather() function takes multiple columns and collapses into key-value
pairs, duplicating all other columns as needed.
 The gather() function can be used when the columns are not unique variables.

 Example :

 Dataset used is TB data. Number of TB cases in 3 different countries

 Here, 3 rows and 4 columns.

 Column names [2:4] are Year values.

 So, we can apply gather() to these columns

under one column (For example : Year)
THE TIDYR PACKAGE
 The gather() function
 Syntax
gather(data, key, value, …, na.rm = FALSE/TRUE
 Example : The gather() function
 The following command is used to convert the data.
gather(cases, “Year", "n", 2:4)
 cases : Dataset Name
 Year: Key
 n: value
 2:4 : Specifications of columns (from 2nd column to 4th
column, the values should be gathered)
THE TIDYR PACKAGE
 The spread() function
 The spread() function spreads a key-value
pair across multiple columns.
 Dataset used here is the pollution data,
which has 6 rows and 3 columns.
 We can spread the values (amount) in
two different columns (For example:
Large and Small)
THE TIDYR PACKAGE
 The spread() function
 Syntax
spread(data, key, value)

 Example : The spread() function

 Command is follows
spread(pollution, size, amount)
 Pollution : data
 Size: key
 Amount : value
THE TIDYR PACKAGE
 The unite() function
 It is convenience function to paste together
multiple columns into one.
 Syntax
unite(data, col, ..., sep = "_“)

 Example
unite(storms2, "date", year, month,
day, sep = "-")
THE TIDYR PACKAGE
 The separate() function
 It turns a single character column into
multiple columns.
 Syntax

separate(data, col, into, sep = “_/:/;/grep”)

 Example
 separate(storms, date, c("year",
"month", "day"), sep = "-")
THE SQLDF PACKAGE
 Many business users had to dealt to RDBMS previously.
 In R, there is a package called “sqldf” for running sql statements and data
manipulation in R
 To install
install.packages(“sqldf”)
 To load :
library(sqldf)
THE SQLDF PACKAGE
 Performing joins is more common in
SQL.
 Left joins : Returns all records from
left table.
 Right joins : Returns all records from
right table.
 Inner joins : Returns records which
are matching among tables.
 Full outer join: Returns all rows
from all tables, if rows are not
matching.
THE SQLDF PACKAGE
 Example 1: select() function
 The following two datasets are used
THE SQLDF PACKAGE
 Example : sqldf package
 Performing Inner Join

 Performing Inner Join and

where clause in it

 Sub setting the data

sqldf("select id from
df1“)
THE DPLYR PACKAGE
 A package that transforms tabular data.
 Functions in dplyr package

 Select
 Filter
 Mutate
 Arrange
 Group_by and
 Summarise
 Data set used is storms data
THE DPLYR PACKAGE
 Example : The select() function
 The select() function keeps only the variables you
mention.
Select(data, ...)
 Syntax
 The command used for the following output
select(storms, storm, pressure)
select(storms, -storm)
THE DPLYR PACKAGE
 Example : The filter() function
 The filter() function return rows with matching
conditions.
filter(storms, wind >= 50,
 Syntax filter(data, ...) storm %in% c("Alberto",
 Command "Alex", "Allison"))
filter(storms, wind >= 50)
THE DPLYR PACKAGE
 Example : The mutate() function
 The mutate() function Derive new variables from existing variables.

 Syntax mutate(data, ...)

 Command
mutate(storms, ratio = pressure /wind)
THE DPLYR PACKAGE
 Example : The arrange() function
 The arrange() function Arrange rows by variables.

 Syntax arrange(data, ...)

 Command
arrange(storms, wind)
arrange(storms, desc(wind))
THE DPLYR PACKAGE
 Example : The group_by() function
 The group_by() function Group a table by one or more variables.

 The group_by() function takes an existing table and converts it into a grouped
table where operations are performed "by group".
 Syntax
group_by(data, ...)
 Command
pollution %>% group_by(city)
THE DPLYR PACKAGE
 Example : The summarise() function.
 The summarise() funciton Summarises multiple values to a single value.

 Syntax
summarise(data, ...)

data : Data frame or Table


 … : Name-value pairs of summary functions like min(), mean(), max() etc.
 Applying various summary functions on Pollution data

 Command

pollution %>% summarise(median = median(amount),

variance = var(amount))
THE DPLYR PACKAGE
 Applying various summary functions on Pollution data

pollution %>% summarise(mean = mean(amount), sum

= sum(amount), n = n())

pollution %>% group_by(city) %>%

summarise(mean = mean(amount), sum
= sum(amount), n = n())
THE DPLYR PACKAGE
 Example : The bind() function
 The bind() efficiently bind multiple data frames by row and column.

 It has two functions under this

 The bind_cols() and bind_rows() function

 The bind_cols() efficiently bind multiple data frames by columns

 The bind_rows() efficiently bind multiple data frames by columns

 Syntax of bind_cols
bind_cols(x, ...)

 Syntax of bind_rows
bind_rows(x, ...)
THE DPLYR PACKAGE
 Example : The bind() functions
 Commands for bind_rows() and bind_cols()

bind_cols(y, z) bind_rows(y, z)
THE DPLYR PACKAGE intersect(y, z)

 Example : Set Operations

 There are four functions under Set
Operations in dplyr package
 The intersect( ) function
union(y, z)
 The union( ) function
 The setdiff( ) function
 The setequal( ) function
 Syntax’s

intersect(x, y, ...) setdiff(y, z)

union(x, y, ...)
setdiff(x, y, ...)
setequal(x, y, ...)
THE DPLYR PACKAGE
 Example : The join operations
 Types of joins in the dplr package along with the syntax

 inner_join(x, y, by = NULL)
 left_join(x, y, by = NULL)
 right_join(x, y, by = NULL)
 full_join(x, y, by = NULL)
THE DPLYR PACKAGE
 Example1 : Left join
left_join(songs, artists, by = "name")

 Example2 : Left join

left_join(songs2, artists2, by = c("first", "last"))
THE DPLYR PACKAGE
 Example : The inner join
inner_join(songs, artists, by = "name")
TRANSFORMATIONS
TRANSFORMATIONS : REASSIGNING VARIABLE
 Reassigning Variables:
 It’s also possible to make other changes to data frames.

 For example, suppose that we wanted to define a new column (midpoint

variable that is the mean of the high and low price.)
 We can add this variable with the same notation:

> dow30$mid <- (dow30$High + dow30$Low)/2

> names(dow30)
[1] "symbol" "Date" "Open" "High" "Low"
[6] "Close" "Volume" "Adj.Close" "mid"
TRANSFORMATIONS
 The transform() function : Function used for changing the number of variables
in a data frame
 Syntax:
transform(data, ...)

 To use transform, you specify a data frame (as the first argument) and a set of
expressions that use variables within the data frame.

 The transform function applies each expression to the data frame and then
returns the final data frame.

> dow30.transformed <- transform(dow30,

Date=as.Date(Date), mid = (High + Low)/2)
APPLYING A FUNCTION TO EACH ELEMENT OF AN OBJECT
 Transforming data is applying a common function to set of objects and
returning a new set of transformed objects.
 The base R library includes set of different functions for doing this.

 Applying a function to an array or matrix

 To apply a function to parts of an array (or matrix), use the apply function:

apply(X, MARGIN, FUN, ...)

 X is an array (or matrix) to which function is applied

 FUN is the function that is applied
 MARGIN Dimensions of the array to which you would like to apply a function
APPLYING A FUNCTION TO AN ARRAY

 Sample example for applying a function to an array or matrix

Here, we have created the matrix called as “x”

with dimensions 5 rows and 4 columns

Now lets show how apply works.

We will use function max to get the highest
numbers in the matrix
APPLYING A FUNCTION TO LIST OR VECTOR
 To apply a function to each element in a vector or a list and return a list, you
can use the function lapply
 Syntax
lapply(X, FUNC, ...)

 The function lapply requires two arguments:

 X : Name of the List or Vector
 FUNC : Name of the function to be applied on List or Vector

 You may specify additional arguments that will be passed to FUNC.

APPLYING A FUNCTION TO LIST OR VECTOR
 Simple example of how to use lapply
 Lets create the list of 5 elements and apply
some function on the list created.
APPLYING A FUNCTION TO A DATA FRAME
 You can apply a function to a data frame, and the function will be applied to
each vector in the data frame.
 Example:
BINNING DATA
 Another common data transformation is to group a set of
observations into bins (groups) based on value of specific variables.
 For example

1. Suppose that you had some time series data where time was
measured in days, but you wanted to summarize the data by
month.

 There are several functions available for binning numeric data in

R.
BINNING DATA- CUT
 Inmany data analysis settings, it might be useful to break
up a continuous variable such as age into a categorical
variable.
 Or, you might want to classify a categorical variable like
year into a larger bin, such as 1990-2000.
 The cut function in R makes this task simple!
BINNING DATA- CUT
 The function cut is useful for taking a continuous variable and splitting it into
discrete pieces
 Here is the default form of cut for use with numeric vectors:

# numeric form
cut(x, breaks)

 There is also a version of cut for manipulating Date objects:

# Date form
cut(x, breaks, start.on.monday = TRUE)

 The cut function takes a numeric vector as input and returns a factor
BINNING DATA- CUT
 Example for cut()
 Lets create the hypothetical clinical data set here
BINNING DATA- CUT
 We will apply cut command on the clinical.trail data frame to make age a
factor (Categorical value).
 Lets see the structure of the data frame

 Applying cut() on the clinical.trial$age (# numeric form)

BINNING DATA- CUT
 Applying cut() on the clinical.trial$year.enroll (#Factor)
 Here, year.enroll column is a categorical data (CD). So we have to convert
CD to numeric data and apply cut() command
DATA CLEANING
 Some of the data sets contain values like 997, 998, and 999 which are not
actual values there might be duplicate records in the data.

 Finding and Removing Duplicates

 Data sources often contain duplicate values.
 It’s a good idea to check for duplicates in your data
 R provides some useful functions for detecting duplicate values.
THANK YOU !!!

Data Wrangling With R
91% (11)
Data Wrangling With R
237 pages
Hashicorp Certified Terraform Associate Practice Questions - Intermediate
100% (1)
Hashicorp Certified Terraform Associate Practice Questions - Intermediate
31 pages
Ranged Queries Using Bloom Filters Final
No ratings yet
Ranged Queries Using Bloom Filters Final
19 pages
Module 7 - (Data Analysis With R Programming)
No ratings yet
Module 7 - (Data Analysis With R Programming)
18 pages
R Data Reshaping - Javatpoint
No ratings yet
R Data Reshaping - Javatpoint
13 pages
Lab4 Instructions
No ratings yet
Lab4 Instructions
52 pages
Study Guide Data Manipulation With R
No ratings yet
Study Guide Data Manipulation With R
4 pages
R-Programming For Data Science
No ratings yet
R-Programming For Data Science
59 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Mod3 Tables EPP
No ratings yet
Mod3 Tables EPP
9 pages
Data Analytics Lesson 10 Notes
No ratings yet
Data Analytics Lesson 10 Notes
7 pages
Unit 4
No ratings yet
Unit 4
60 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
32 pages
Unit 4
No ratings yet
Unit 4
60 pages
Coursera Notes
No ratings yet
Coursera Notes
4 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Unit 3
No ratings yet
Unit 3
36 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
R Prog
No ratings yet
R Prog
27 pages
DSCI 100 Cheat Sheet
No ratings yet
DSCI 100 Cheat Sheet
3 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
22 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
Unit-1 (Part-2) : Loading and Handling Data in R
No ratings yet
Unit-1 (Part-2) : Loading and Handling Data in R
78 pages
UNIT 2-Upto Chapter 2.3
No ratings yet
UNIT 2-Upto Chapter 2.3
23 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
R Vectors
No ratings yet
R Vectors
22 pages
INF30036 DataTypes Lecture2-1
No ratings yet
INF30036 DataTypes Lecture2-1
42 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
CRC Data Science
No ratings yet
CRC Data Science
443 pages
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
20 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Python For Data Science
No ratings yet
Python For Data Science
12 pages
Module II
No ratings yet
Module II
40 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Data Manipulation With R - Second Edition - Sample Chapter
No ratings yet
Data Manipulation With R - Second Edition - Sample Chapter
34 pages
Lect01 2
No ratings yet
Lect01 2
19 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Data Minig and Techniquezz
No ratings yet
Data Minig and Techniquezz
48 pages
R Programming
No ratings yet
R Programming
22 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Importing The Files
No ratings yet
Importing The Files
14 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
I R A E D: Mport EAD ND Xport ATA
No ratings yet
I R A E D: Mport EAD ND Xport ATA
28 pages
Functions N Built: Pavan Kumar A
No ratings yet
Functions N Built: Pavan Kumar A
19 pages
Introduction To R: Pavan Kumar A
No ratings yet
Introduction To R: Pavan Kumar A
55 pages
C S I R: Ontrol Tructures N
No ratings yet
C S I R: Ontrol Tructures N
18 pages
Escriptive Tatistics ND Tabulation: Pavan Kumar A
No ratings yet
Escriptive Tatistics ND Tabulation: Pavan Kumar A
25 pages
ATA Tructures In: Pavan Kumar A
No ratings yet
ATA Tructures In: Pavan Kumar A
35 pages
FMS Mba Naresh Darapu PDF
No ratings yet
FMS Mba Naresh Darapu PDF
1 page
FMS Mba Naresh Darapu PDF
No ratings yet
FMS Mba Naresh Darapu PDF
1 page
FMS Mba Naresh Darapu PDF
No ratings yet
FMS Mba Naresh Darapu PDF
1 page
CCNA Training Hot Standby Router Protocol HSRP Tutorial
No ratings yet
CCNA Training Hot Standby Router Protocol HSRP Tutorial
4 pages
Tender Document For Network Switch, Router, Network Management Software, Server, SAN Storage, SAN Switch and Virtualization Software
No ratings yet
Tender Document For Network Switch, Router, Network Management Software, Server, SAN Storage, SAN Switch and Virtualization Software
18 pages
Primary DBA Responsibilities
No ratings yet
Primary DBA Responsibilities
19 pages
Itec66 Information Assurance and Security 2
No ratings yet
Itec66 Information Assurance and Security 2
9 pages
Linux-Foundation Testking LFCS v2017-04-15 by Marta 125q PDF
No ratings yet
Linux-Foundation Testking LFCS v2017-04-15 by Marta 125q PDF
52 pages
BEC601 Module 1 Notes
No ratings yet
BEC601 Module 1 Notes
29 pages
DSA Sallybus
No ratings yet
DSA Sallybus
5 pages
Dump
No ratings yet
Dump
2 pages
GettingStarted PDF
No ratings yet
GettingStarted PDF
7 pages
Command Line Access Tips For Utilizing Earthdata Login
No ratings yet
Command Line Access Tips For Utilizing Earthdata Login
5 pages
163b Advanced Rdbms QP
No ratings yet
163b Advanced Rdbms QP
47 pages
Microcontrollers Applications Solved Questions Answers
No ratings yet
Microcontrollers Applications Solved Questions Answers
60 pages
KVR32S22S8/16: Memory Module Specifi Cations
No ratings yet
KVR32S22S8/16: Memory Module Specifi Cations
2 pages
Wire Shark
No ratings yet
Wire Shark
11 pages
Huffman Coding
No ratings yet
Huffman Coding
16 pages
Oracle: Group Members: Hamza Ahmad
No ratings yet
Oracle: Group Members: Hamza Ahmad
28 pages
Mail Server System
100% (1)
Mail Server System
62 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
31 pages
Photography Catalogue
No ratings yet
Photography Catalogue
8 pages
4 Normal Form: By: Karen Mcvay
No ratings yet
4 Normal Form: By: Karen Mcvay
25 pages
MySQL 01
No ratings yet
MySQL 01
3 pages
Class 03: 8051 Microcontroller Memory Organisation
No ratings yet
Class 03: 8051 Microcontroller Memory Organisation
7 pages
Cryptography and Network Security
No ratings yet
Cryptography and Network Security
13 pages
COA Training Narrative Report XXX
No ratings yet
COA Training Narrative Report XXX
2 pages
Merging Charts and Entities: IBM I2 Analyst'S Notebook
No ratings yet
Merging Charts and Entities: IBM I2 Analyst'S Notebook
4 pages
List of Ms Dos Commands PDF
100% (1)
List of Ms Dos Commands PDF
34 pages
Android Sqlite Example Application
No ratings yet
Android Sqlite Example Application
10 pages
Storage
No ratings yet
Storage
119 pages

2 Manipulating Processing Data

Uploaded by

2 Manipulating Processing Data

Uploaded by

MANIPULATING AND

 From a long term clinical study we may be interested in investigating

 Following are the arguments taken by merge() funciton

Combines mydata1 and mydata2 on the

Combines the data of mydata1 and

Combines “ID” and “Names” of mydata1 and “English” and

Combines “ID”, “Names” and “Social "of mydata1 and

 sort() : Used to sort the values contained in a vector

Converts rows to columns and

 Documentation for users:

 Copy and Paste data

 Provide link of Google Docs.

 CSVkit is a suite of utilities for converting to and working with CSV

 The plyr & dplyr: The split-apply-combine strategy for R.

 The reshape2: For restructure and aggregate data.

 The Data.table: Speed with large data sets

 The Stringr: Package for text manipulation

 To use the above packages; install and load

 The ggplot2 requires wide-format data.

 So, reshape2 package should be installed and loaded.

 Sample example is shown here.

melt(data, id.vars, measure.vars, variable.name = "variable",

1. Here, data frame, named dataMelt is created.

Here, we are applying

 Exampe 1: The dcast() function

 Check with the following formula

 Gather and Spread

 Dataset used is TB data. Number of TB cases in 3 different countries

 Column names [2:4] are Year values.

 So, we can apply gather() to these columns

 Example : The spread() function

separate(data, col, into, sep = “_/:/;/grep”)

 Performing Inner Join and

 Sub setting the data

 Syntax mutate(data, ...)

 Syntax arrange(data, ...)

data : Data frame or Table

pollution %>% summarise(median = median(amount),

pollution %>% summarise(mean = mean(amount), sum

pollution %>% group_by(city) %>%

 It has two functions under this

 The bind_cols() and bind_rows() function

 The bind_rows() efficiently bind multiple data frames by columns

 Example : Set Operations

intersect(x, y, ...) setdiff(y, z)

 Example2 : Left join

 For example, suppose that we wanted to define a new column (midpoint

> dow30$mid <- (dow30$High + dow30$Low)/2

> dow30.transformed <- transform(dow30,

 Applying a function to an array or matrix

apply(X, MARGIN, FUN, ...)

 X is an array (or matrix) to which function is applied

 Sample example for applying a function to an array or matrix

Here, we have created the matrix called as “x”

Now lets show how apply works.

 The function lapply requires two arguments:

 You may specify additional arguments that will be passed to FUNC.

 There are several functions available for binning numeric data in

 There is also a version of cut for manipulating Date objects:

 Applying cut() on the clinical.trial$age (# numeric form)

 Finding and Removing Duplicates

You might also like