Kids C ("Jack", "Jill") : 5.1 Creating Data Frames
Kids C ("Jack", "Jill") : 5.1 Creating Data Frames
kids c("Jack","Jill")
ages - c(12,10)
d - data. frame (kids,ages, stringsAsFactors=FALSE)
d matrix-1ike viewpoint
kids ages
1 Jack 12
2 Jill 10
The fist two arguments in the call to data.frame () are clear: We wish to
produce a data firame from our two vectors: kids and ages. However, that
third argument, stringsAsFactors=FALSE requires more comment.
Ifthe named argument stringsAs Factors is not specified, then by default,
stringsAs Factors will be TRUE. (You can also use options() to arrange the oPpo-
site default.) This means that if we create a data frame from a character
vector-in this case. kids-R will comvert that vector to a factor. Because our
work with character data will typically be with vectors rather than factors,
we'll set stringsAsFactors to FALSE. We'll cover factors in Chapter 6.
d[[1]]
(1] "Jack" "Jill"
d$kids
(1] "Jack" "Jil1"
d[,1]
1] "Jack" "Jill"
This matrix-like quality is also seen when we take d apart using str():
str(d)
data.frame": 2 obs. of 2 variables:
$ kids: chr "Jack" "Jill"
$ ages: num 12 10
2 Chopler 5
Consiler three vars to access the irst colunnn of our clata
Irane above
dl|111. d.11, and dsk ids. OI ilhese. the tlhird would generally considered to
be learei and, mOre impotantly, saler than the first two. This better iden-
tifies the column and makes it less likely that you will reference the
wrong
colun. Bn in writing geneal code-say writing R
yackages-nmatrix-like
oation dl,1] is nerdenl, and it is esprcially hanly il vou ane extracting sub-
data frnes (as vOu l see wlhen we lalk aboul extracting sullata tranes in
Section 5.2).
head(exams quiz)
Exam.1 Exam.2 Quiz
1 2.0 3.3 4.0
2 3.3 2.0 3.7
3 4.0 4.0 4.0
4 2.3 0.0 3.3
examsquiz[2:5,2]
1) 2 4 0 1
class(exams quiz[2:5,2])
[1] "numeric"
examsquiz[2:5,2,drop-FALSE]
Exam.2
5 1
class(examsquiz [2:5,2,drop=FALSE])
[1] "data. frame"
examsquiz[examsquiz$Exam. 1 = 3.8,]
Exam.1 Exam.2 Quiz
4 4.0 4.0
4 3.3 4.0
11 4 4.0 4.00
14 4 0.0 4.0
16 4 3.7 4.0
104 Chopter 5
19 4 4.0 4.0
22 4 4.0 4.0
25 4 4.0 3.3
29 4 3.0 3.7
2.0 NA 4.0
In some cases, we may wish to rid our data frame of any observation
that has at least one NA value. A handy function for this purpose is
complete.cases().
d
kids ages
1 Jack 12
2 Jill 10
You can also create new columns from old ones. For
add a variable that
instance, we can
is the difference between exams I and 2:
106 Chapte
3 4.0 4.0 4.0 0.0
4 2.3 0.0 3.3 -2.3
5 2.3 1.0 3.3 -1.3
6 3.3 3.7 4.0 0.4
d
kids ages
Jack 12
2 ill 10
dsone. 1
d
kids ages one
1 Jack 12
2 11 10
107
, rargi,f)
Aouy
oshs
oeh. o luny
rai =2 ,
to
5.2.5 Extended Exomple: ASolory Study
cousidered the question, "Ilow
In a study of engineeis and programmers, I
brightest-that people
is, of
the best and the
many of these workers ane changed here.)
(Some of the details have been
cxtraordinary abiliiy?"
had available was linited. One (admittedly
The government data I is
to cdetermine whether a worker is of exraordinary ability
imperfect) way for
to lookthe ratio ofl actual salary to the government prevailing wage
at
is substantially higher than 1.0, you can
thatjob and location. If that ratio has a high level of talent.
assune that this worker
reasonably
I used Rto prepare and analyze the data and will present excerpts of my
I read in the data file:
preparation cocde here, First,
data contain
These operations are typical data cleaning. Most large sels
any analysis.
I also needed to, create a new column lor the ratio belween actual wage
and prevailing wage:
medrat function(dataframe) (
return(median(dataframeSrat, na. rm=TRUE) )
data sets.
108 Chapter
I was particularly innterested in three occupations and thus extracted
subdaa frames for them to make their
analyses more covenient:
se2006 « all2006[ grep( "Software Engineer",all2006),
prg2006 all2006[grep("Programner",all2006),
ee2006 al12006{grep("Electronics Engineer",all2006),]
Here, I used R's grep() function to identify the rows containing the given
job title. Details on this function are in Chapter 11.
Another aspect of interest was analysis by firm. I wrote this function to
extract the sublata frame for a given firm:
makecorp function(corpname){
t all20o6[all2006SEmployer_Nane -= corpnae,]
return(t)
I then crealed subdata frames for a number of firms (only some are
shown here).
for (i in 1:(length(corplist)/2)) {
corp - corplist[2*i-1]
newdtf - paste(corplist[2»i],"2006", sep=")
assign(newdtf, makecorp(corp),posS=.GlobalEnv)
merge(x, y)
Here, the two data frames have the variable kids in common. R found
the rows in which this variable had the same value of kids in both data frames
(the ones for Jack and Jill). It then created a data frame with corresponding
rows and with columns taken from data frames (kids, states, and
ages).
has named arguments by.x and by.y, which handle
The merge() function
Cases in which variables have similar information but different names in the
two data frames. Here's an
example:
d3
ages pals
1 12 Jack
2 10 Jill
3 7 Lillian
merge(d1, d3,by.x="kids", by.y="pals")
kids states ages
1 Jack CA 12
2 Jill MA
Even though our variable was called kids in one data frame and pals in
the other, it was meant to store the same information, and thus the merge
made sense.
matches will appear in full in the result. possibly in undesir
Duplicate
able ways.
d1
kids states
1 Jack CA
2 Jill MA
110 Chople
3 Dillian MA
4 John HI
d2a rbind(d2,list(15,")ill"))
d2a
ages kids
12 Jack
2 10 Jill
7 Lillian
4 15 Jill
merge(d1, dza)
kids states ages
Jack CA .2
2 Jill MA 10
ill MA 5
These are names for the employee ID, three performance raungs, and
the job title. DB had no header. The variables again began with the lID, fol-
lowed by start and end dates of employment.
Both files were in CSV format. Part of my data-cleaning phase consisted
of checking that each record contained the proper number of fields. DA, for
example, should have five fields per record. Here is the check: