Dup Spector Data Manipulation With R Springer 2008 PDF
Dup Spector Data Manipulation With R Springer 2008 PDF
Jim Holtman
Kroger
Data Munger Guru
Topics Covered
What is data munging
Debugging
Data Munging
Your desktop dictionary may not include it, but 'munging' is a
common term in the programmer's world. Many computing tasks
require taking data from one computer system, manipulating it in
some way, and passing it to another. Munging can mean
manipulating raw data to achieve a final form. It can mean parsing
or filtering data, or the many steps required for data recognition.
Summarizing Data
Various ways of collecting information about relationships of data
elements
?tapply
Count
10
23
89
3
3
split by
Key
10
23
89
Key = 1
Key = 2
Key = 3
sum
33
91
?data.table
?sqldf
plyr Package
plyr: Tools for splitting, applying and combining data
plyr is a set of tools that solves a common set of problems: you
need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
For example, you might want to fit a model to each spatial
location or time point in your study, summarise data by panels or
collapse high-dimensional arrays to simpler summary statistics.
10
Density
Mean
20
40
60
80
Items Shipped
11
Using C++/Java
Using Excel (Pivot Tables?)
Using SQL
Your other favorite language
12
13
264000
265000
263000
266000
Is there
seasonal
variation in the
data?
10
20
30
40
50
Weeks
14
150000
100000
0
50000
200000
250000
10
20
30
40
50
Weeks
15
16
17
In this example, I have a file with the names of the 4000 products
that I would like to add to the 2.5M row dataframe that I have that
defines the shipments.
Functions like merge are nice, but hide what they are doing. It
19
20
Using merge
merge is general purpose and does a lots of checking/validation
21
22
23
24
If your data has quotes, and is not a csv file, you will probably have
If your data lines do not all have the same number of fields, you may
read.table tries to determine what type each field is, but it is best to
25
26
Time Classes
Some of your data will probably have some columns with time/date
There are several different classes that can be used, but the two
See the R Journal 4/1 June 2004 for a good discussion on the subject.
Using dates has a learning curve; the above reference helps.
Times and dates are typically read in as character strings and then
28
60
40
0
20
VMstat$user + VMstat$sys
80
02:00
07:00
12:00
17:00
22:00
VMstat$POSIX
29
Boxplots
Many organizations like to summarize the utilization on some time
of the data. The box contains 50% of the data points (between the
25th and 75% percentiles). The line in the box is the median value.
The whiskers extend above/below the box to the last data point or a
Any data points lying outside the whiskers are plotted as individual
points.
30
40
0
20
Utilization
60
80
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Time of Day
31
have written programs in) were R and Perl: Perl to prepare the data
for R, and R to analyze the data.
and I have had to revert to Perl less and less since I can do most of
my processing in R.
the number of times each word occurs and prints the 20 most
frequently appearing.
Lets then select one, and list out all that contain that word.
32
33
34
Debugging
All programs have bugs.
which it happened
One way of trapping the error and gaining control is to put the
error message
Calling stacks
go to stack frame 2
36
FAQ 7.31
In the R-Help news group, this is referred to a lot: Why doesn't R
cut
data.frame
density
deparse
dev.off
diff
dim
do.call
duplicated
eval
exists
factor
floor
flush.console
for
function
gc
get
grep
help.search
hist
if
ifelse
image
integer
jitter
lapply
layout
layout.show
length
level.plot
levels
lines
list
lm
load
ls
match
matplot
matrix
max
mean
median
min
mtext
names
nchar
ncol
next
nrow
numeric
options
order
pairs
palette
par
parse
paste
pdf
plot
postscript
print
quantile
quit
range
rbind
read.csv
read.table
regexpr
rep
return
rle
rm
row
rowMeans
rownames
rowSums
Rprof
rug
sample
sapply
save
save.image
scan
seq
set.seed
setwd
sink
sort
source
split
sprintf
str
strftime
strptime
strsplit
structure
substr
sum
summary
supsmu
table
tapply
terms
text
title
traceback
trunc
trunc.POSIXt
truncate
try
unclass
unique
unlist
which
which.max
which.min
while
window
with
write.csv