0% found this document useful (0 votes)
188 views38 pages

Dup Spector Data Manipulation With R Springer 2008 PDF

This document provides an overview of a presentation on data munging using R. It discusses what data munging is, summarizing data using tools like exploratory data analysis, visualization, and packages like tapply, data.table, sqldf, and plyr. It then covers examples of summarizing sample shipment data by week, store, and product to demonstrate these techniques. Debugging and improving performance through profiling and optimizing data reads are also addressed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
188 views38 pages

Dup Spector Data Manipulation With R Springer 2008 PDF

This document provides an overview of a presentation on data munging using R. It discusses what data munging is, summarizing data using tools like exploratory data analysis, visualization, and packages like tapply, data.table, sqldf, and plyr. It then covers examples of summarizing sample shipment data by week, store, and product to demonstrate these techniques. Debugging and improving performance through profiling and optimizing data reads are also addressed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Data Munging With

Jim Holtman
Kroger
Data Munger Guru

CinDay R User Group

Topics Covered
What is data munging

Summarizing data with various tools


EDA: exploratory data analysis
Visualization of the data
Measuring performance
Reading in data & Time/Date classes

Debugging

CinDay R User Group

Data Munging
Your desktop dictionary may not include it, but 'munging' is a
common term in the programmer's world. Many computing tasks
require taking data from one computer system, manipulating it in
some way, and passing it to another. Munging can mean
manipulating raw data to achieve a final form. It can mean parsing
or filtering data, or the many steps required for data recognition.

R is an open source software package directed at analyzing and


visualizing data, but with the power of the language, and available

packages, it also provides a powerful means of slicing/dicing the


data to get it into a form for analysis.
3

CinDay R User Group

Summarizing Data
Various ways of collecting information about relationships of data

elements

I am going to use weekly shipments of products to stores


Create the data since I cannot use actual (proprietary) information, but the
techniques are the same.
52 weeks of deliveries to 12 stores of 4000 products (~2.5M rows of data)
Tools used
tapply: part of the base R
data.table: package that is fast for many of these summarization operations; it has
been one that I am using more and more.
sqldf: package that allows SQL access to dataframes; shortens the learning curve
on some R activities if you already know SQL.
plyr: package for slicing/dicing that is used by many users.

CinDay R User Group

?tapply

CinDay R User Group

tapply(x$Count, x$Key, sum)


x
Key

Count

10

23

89

3
3

split by
Key

10

23

89

Key = 1

Key = 2

Key = 3

sum

33

91

CinDay R User Group

?data.table

CinDay R User Group

?sqldf

CinDay R User Group

plyr Package
plyr: Tools for splitting, applying and combining data
plyr is a set of tools that solves a common set of problems: you
need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
For example, you might want to fit a model to each spatial
location or time point in your study, summarise data by panels or
collapse high-dimensional arrays to simpler summary statistics.

CinDay R User Group

Setup for Script

10

CinDay R User Group

EDA: Distribution of ship Data


Density Plot of "ship" Distribution
Actual Data
rexp Distribution

Density

Mean

20

40

60

80

Items Shipped

11

CinDay R User Group

How To Determine Shipments Per Week?


What process would you use to create a

summary of shipments per week?

Using C++/Java
Using Excel (Pivot Tables?)
Using SQL
Your other favorite language

What approach would you use in R?


You want to work on the objects as a whole.
Think of how you would split/partition the data
and then operate on each group.

12

CinDay R User Group

Total Products Ordered Per Week


Anything interesting
about the time it took
to execute the
various commands?
Which one would
you want to use?

Notice that all the commands


above returned the same values.

13

CinDay R User Group

Plot of Shipments Per Week


267000

Total Shipments by Week

264000

265000

Notice the yaxis scaling.

263000

Cases Per Week

266000

Is there
seasonal
variation in the
data?

10

20

30

40

50

Weeks

14

CinDay R User Group

Better? Plot of Shipments

150000
100000
0

50000

Cases Per Week

200000

250000

Total Shipments by Week

10

20

30

40

50

Weeks

15

CinDay R User Group

Products Per Store Per Week

16

CinDay R User Group

Use View to Look at Your Data


Brings up a
separate window
that you can
scroll through to
see all the
information in a
dataframe.

Does this data seem


reasonable?

17

CinDay R User Group

Store per Week by UPC (Original Data!)

This is from the original


creation of the data and we
did get back the same result.
18

CinDay R User Group

Lets Add Some Extra Information to the Data


In many cases, you may have data from different tables that you

want to join (merge) together based on a common key.

In this example, I have a file with the names of the 4000 products

that I would like to add to the 2.5M row dataframe that I have that
defines the shipments.

In SQL I would do a JOIN; in R I could use the merge function, or I

could do it with some of the basic functions.

Functions like merge are nice, but hide what they are doing. It

is good to understand what is happening so if necessary, you can


improve the performance of your program.

19

CinDay R User Group

Read in the UPC Name File

20

CinDay R User Group

Using merge
merge is general purpose and does a lots of checking/validation

that can lead to extended execution times.

21

CinDay R User Group

Using the base functions


Understanding how some of the base functions work can lead to

improved performance. The technique of creating a set of indices


and then using them is powerful and gets to the heart of R with
vectorization of operations. Notice that this is 100X faster than
the use of merge and gives the same result.

22

CinDay R User Group

Where Does the Time Go?


Profiling helps to see what is happening.

Of the 32 secs, 18.6 were


consumed by the nchar
function which counts the
number of characters in a
character object. 6.2 secs were
in the make.unique which
makes character strings
unique, which is important
when combining dataframes
that might have the same
names for columns.

As mentioned before, merge is


general purpose and does a lot
of validation on the data since
it is not sure what the caller
may be passing in.

23

CinDay R User Group

Another Way of Showing the Rprof Data


C:\jph\CinDay>perl /perf/bin/readRprof.pl Rprof.out
0 33.0 root
1.
33.0 system.time
2. .
32.8 merge
3. . .
32.8 merge.data.frame
4. . . .
21.5 cbind
5. . . . |
21.5 cbind
6. . . . | .
21.5 data.frame
7. . . . | . .
18.6 nchar
7. . . . | . .
0.7 unlist
7. . . . | . .
0.2 data.row.names
8. . . . | . . .
0.2 anyDuplicated
9. . . . | . . . .
0.2 anyDuplicated.default
7. . . . | . .
0.2 anyDuplicated
8. . . . | . . .
0.2 anyDuplicated.default
7. . . . | . .
0.1 list
7. . . . | . .
0.0 any
7. . . . | . .
0.0 attr<7. . . . | . .
0.0 is.na
4. . . .
10.1 [
5. . . . |
10.1 [.data.frame
6. . . . | .
7.5 make.unique
7. . . . | . .
1.3 as.character
6. . . . | .
0.5 anyDuplicated
7. . . . | . .
0.5 anyDuplicated.default
6. . . . | .
0.3 sort.list
6. . . . | .
0.1 is.na
6. . . . | .
0.1 vector
7. . . . | . .
0.1 length
8. . . . | . . .
0.1 length
6. . . . | .
0.0 any
6. . . . | .
0.0 c
6. . . . | .
0.0 attr<4. . . .
0.4 match
4. . . .
0.1 names<4. . . .
0.0 row.names<5. . . . |
0.0 row.names<-.data.frame
2. .
0.3 gc

24

This shows that most of the time (21.5


secs) is spend in cbind putting
together the resulting dataframe. It is
in there you can see 18.6 secs being
used by nchar.
This shows the calling tree.
The 10.1 secs being used by [ is the
accessing of information in a
dataframe. This can be costly if you
are doing a lot of it. In many cases,
depending on the structure of your
data, you are better off (performance
wise) is using a matrix instead of a
dataframe.

CinDay R User Group

Hints on Reading in Data


If you dont need factors, use as.is = TRUE in read.table &

read.csv to read in as characters.

Also goes when creating data.frames; use stringsAsFactors = FALSE

If your data has quotes, and is not a csv file, you will probably have

to have quotes = as a parameter. If you dont, you will probably


see fewer lines read than what you thought you had in your file.

If your data has # as part of data, use comment.char=.

If your data lines do not all have the same number of fields, you may

have to understand what the fill and flush parameters do.

read.table tries to determine what type each field is, but it is best to

use colClasses to explicitly define the type of each field.

25

CinDay R User Group

Sample Performance Data From UNIX


Blank separated fields from a vmstat command executed every 30

seconds during the day.

date time r b w swap free re mf pi po fr de sr intr syscalls cs user sys id


07/27/05 00:13:06 0 0 0 27755440 13051648 20 86 0 0 0 0 0 456 2918 1323 0 1 99
07/27/05 00:13:36 0 0 0 27755280 13051480 11 53 0 0 0 0 0 399 1722 1411 0 1 99
07/27/05 00:14:06 0 0 0 27753952 13051248 18 88 0 0 0 0 0 424 1259 1254 0 1 99
07/27/05 00:14:36 0 0 0 27755304 13051496 17 85 0 0 0 0 0 430 1029 1246 0 1 99
07/27/05 00:15:06 0 0 0 27755064 13051232 41 278 0 1 1 0 0 452 2047 1386 0 1 99
07/27/05 00:15:36 0 0 0 27753824 13040720 125 1039 0 0 0 0 0 664 4097 1901 3 2 95
07/27/05 00:16:06 0 0 0 27754472 13027000 15 91 0 0 0 0 0 432 1160 1273 0 1 99
07/27/05 00:16:36 0 0 0 27754568 13027104 17 85 0 0 0 0 0 416 1058 1271 0 1 99

07/27/05 00:17:06 0 0 0 27754560 13027096 13 69 0 0 0 0 0 425 1198 1268 0 1 99


07/27/05 00:17:36 0 0 0 27754704 13027240 12 51 0 1 1 0 0 432 1727 1477 0 1 99
07/27/05 00:18:06 0 0 0 27755096 13027592 27 120 0 0 0 0 0 426 1449 1302 0 1 99
07/27/05 00:18:36 0 0 0 27755168 13027664 16 76 0 0 0 0 0 420 1002 1278 0 1 99
07/27/05 00:19:06 0 0 0 27755096 13027584 14 86 0 0 0 0 0 410 1224 1263 0 1 99
07/27/05 00:19:36 0 0 0 27755344 13027832 7 26 0 0 0 0 0 409 1606 1445 0 1 99
07/27/05 00:20:06 0 0 0 27755168 13027624 56 337 0 1 1 0 0 438 2112 1406 0 1 98
07/27/05 00:20:36 0 0 0 27755496 13027872 16 77 0 0 0 0 0 418 1045 1259 0 1 99
07/27/05 00:21:06 0 0 0 27755648 13028016 14 88 0 0 0 0 0 410 1264 1254 0 1 99
07/27/05 00:21:36 0 0 0 27755712 13028088 8 34 0 0 0 0 0 418 1666 1427 0 1 99
07/27/05 00:22:06 0 0 0 27755816 13028192 14 76 0 0 0 0 0 443 1246 1295 0 1 99
07/27/05 00:22:36 0 0 0 27755816 13028184 19 85 0 1 1 0 0 422 1084 1277 0 1 99

26

CinDay R User Group

Time Classes
Some of your data will probably have some columns with time/date

that you will have to handle.

Need to convert from a character string into some time/date class


There are operations you can perform on dates: differences between them, when is
a start of a month/quarter/year, plotting/summarizing by date, etc.

There are several different classes that can be used, but the two

most prevalent one are POSIX and Date

See the R Journal 4/1 June 2004 for a good discussion on the subject.
Using dates has a learning curve; the above reference helps.

Times and dates are typically read in as character strings and then

converted to the appropriate date class

I use POSIXct for almost all my date related values


This is based on 1/1/1970 as the epoch which is the same as UNIX/LINUX uses and
makes the transfer of data between systems easier.
27

CinDay R User Group

Read In and Convert the Time

28

CinDay R User Group

Plot user + sys Over Time

60
40
0

20

VMstat$user + VMstat$sys

80

plot(VMstat$POSIX, VMstat$user + VMstat$sys, type='l')


lines(VMstat$POSIX, VMstat$sys, col='red')
abline(h=mean(VMstat$user + VMstat$sys), col='green', lwd=3)

02:00

07:00

12:00

17:00

22:00

VMstat$POSIX

29

CinDay R User Group

Boxplots
Many organizations like to summarize the utilization on some time

period. I am going to assume that we would like to see statistics for


each one hour period during the day.

One technique that is used is to created a box and whiskers chart

of the data. The box contains 50% of the data points (between the
25th and 75% percentiles). The line in the box is the median value.

The whiskers extend above/below the box to the last data point or a

maximum of 1.5X the size of the box.

Any data points lying outside the whiskers are plotted as individual

points.

30

CinDay R User Group

boxplot Showing Utilization in Each Hour

40
0

20

Utilization

60

80

VMstat$hour <- as.integer(format(VMstat$POSIX, format = "% H"))


boxplot(user + sys ~ hour, data=VMstat, ylab="Utilization", xlab="Time of Day")

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Time of Day

31

CinDay R User Group

String Handling/Regular Expressions


Until recently, the only two languages I needed (out of the over 100 I

have written programs in) were R and Perl: Perl to prepare the data
for R, and R to analyze the data.

R currently has most of the regular expression capabilities of Perl,

and I have had to revert to Perl less and less since I can do most of
my processing in R.

So with the 4,000 product descriptions that we have, lets count up

the number of times each word occurs and prints the 20 most
frequently appearing.

Lets then select one, and list out all that contain that word.

32

CinDay R User Group

33

CinDay R User Group

34

CinDay R User Group

Debugging
All programs have bugs.

When the error occurs, you need to see the environment in

which it happened

May be deep in a series of functions calls


Need to go up through each level to see what the parameters were
Need to examine the objects in each function environment

One way of trapping the error and gaining control is to put the

following function call in your script; I have it as part of my Startup


so that it is always active:
options(error = utils::recover)
On a error it will give you the stack trace and let you set the browser at the
appropriate environment to examine values.

Also checkout the debug package.


35

CinDay R User Group

Example of Processing Error

error message

Calling stacks

go to stack frame 2

get list of objects in frame


examine value of x

36

CinDay R User Group

FAQ 7.31
In the R-Help news group, this is referred to a lot: Why doesn't R

think these numbers are equal?

What Every Computer Scientist Should Know About Floating-Point Arithmetic,


ACM Computing Surveys, 23/1, 548, also available via
http://www.validlab.com/goldberg/paper.pdf.
37

CinDay R User Group

Subset of R Functions to Start With


abline
abs
all
all.equal
any
apply
approx
approxfun
arrows
as.integer
as.numeric
as.POSIXct
assign
attr
axis
barplot
boxplot
break
c
cat
cbind
ceiling
character
colMeans
colSums
count.fields
cummax
cummin
cumprod
cumsum
curve
38

cut
data.frame
density
deparse
dev.off
diff
dim
do.call
duplicated
eval
exists
factor
floor
flush.console
for
function
gc
get
grep
help.search
hist
if
ifelse
image
integer
jitter
lapply
layout
layout.show
length
level.plot

levels
lines
list
lm
load
ls
match
matplot
matrix
max
mean
median
min
mtext
names
nchar
ncol
next
nrow
numeric
options
order
pairs
palette
par
parse
paste
pdf
plot
postscript
print

quantile
quit
range
rbind
read.csv
read.table
regexpr
rep
return
rle
rm
row
rowMeans
rownames
rowSums
Rprof
rug
sample
sapply
save
save.image
scan
seq
set.seed
setwd
sink
sort
source
split
sprintf
str

strftime
strptime
strsplit
structure
substr
sum
summary
supsmu
table
tapply
terms
text
title
traceback
trunc
trunc.POSIXt
truncate
try
unclass
unique
unlist
which
which.max
which.min
while
window
with
write.csv

CinDay R User Group

You might also like