0% found this document useful (0 votes)
57 views11 pages

Kids C ("Jack", "Jill") : 5.1 Creating Data Frames

1. The document discusses creating and accessing data frames in R. It provides an example of creating a simple two column data frame from vectors and explains how to access elements using column names or indices. 2. It then discusses an example data frame created from exam score data with variable names in the header. It demonstrates extracting sub data frames by row or column. 3. The document also discusses handling missing values (NAs) in data frames, including using the na.rm and subset functions to exclude rows with NAs from calculations or subsetting.

Uploaded by

Akanksha Mehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views11 pages

Kids C ("Jack", "Jill") : 5.1 Creating Data Frames

1. The document discusses creating and accessing data frames in R. It provides an example of creating a simple two column data frame from vectors and explains how to access elements using column names or indices. 2. It then discusses an example data frame created from exam score data with variable names in the header. It demonstrates extracting sub data frames by row or column. 3. The document also discusses handling missing values (NAs) in data frames, including using the na.rm and subset functions to exclude rows with NAs from calculations or subsetting.

Uploaded by

Akanksha Mehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

5.

1 Creating Data Frames


To begin, let's take another look at our simple data frame example fron
Section 1.4.5:

kids c("Jack","Jill")
ages - c(12,10)
d - data. frame (kids,ages, stringsAsFactors=FALSE)
d matrix-1ike viewpoint
kids ages
1 Jack 12

2 Jill 10

The fist two arguments in the call to data.frame () are clear: We wish to
produce a data firame from our two vectors: kids and ages. However, that
third argument, stringsAsFactors=FALSE requires more comment.
Ifthe named argument stringsAs Factors is not specified, then by default,
stringsAs Factors will be TRUE. (You can also use options() to arrange the oPpo-
site default.) This means that if we create a data frame from a character
vector-in this case. kids-R will comvert that vector to a factor. Because our
work with character data will typically be with vectors rather than factors,
we'll set stringsAsFactors to FALSE. We'll cover factors in Chapter 6.

5.1.1 Accessing Data Frames


Now that we have a data frame, let's explore a bit. Since d is a list, we can
access it as such via component index values or component names:

d[[1]]
(1] "Jack" "Jill"
d$kids
(1] "Jack" "Jil1"

But we can treat it in a matrix-like fashion as well. For example, we can


view column 1:

d[,1]
1] "Jack" "Jill"

This matrix-like quality is also seen when we take d apart using str():

str(d)
data.frame": 2 obs. of 2 variables:
$ kids: chr "Jack" "Jill"
$ ages: num 12 10

R tells us here that d consists of two observations-our two rows-that


store data on two variables-our two columns

2 Chopler 5
Consiler three vars to access the irst colunnn of our clata
Irane above
dl|111. d.11, and dsk ids. OI ilhese. the tlhird would generally considered to
be learei and, mOre impotantly, saler than the first two. This better iden-
tifies the column and makes it less likely that you will reference the
wrong
colun. Bn in writing geneal code-say writing R
yackages-nmatrix-like
oation dl,1] is nerdenl, and it is esprcially hanly il vou ane extracting sub-
data frnes (as vOu l see wlhen we lalk aboul extracting sullata tranes in
Section 5.2).

5.1.2 Extended Example: Regression Analysis of Exam Grades Continued


Recall our course examination data set in Section 1.5. There, we didn't
have a header, but lor this exanmple we do, and ihe firs1 few records in the
lile now are as follows:

Exam 1" "Exam 2" Quiz


2.0 3.3 4.0
3.3 2.0 3.7
4.0 4.0 4.0
2.3 0.0 3.3
2.3 1.0 3.3
3.3 3.7 4.0

As you can see, each line contains the three for


test scores one student.
This is the classic
rwo-dimensional file notion, like that alluded to in the pre-
ceding output of str(). Here, each line in our file contains the data lor one
observation in a statistical data set. The idea of a daua frame is to encapsulate
such data, along with variable names, into one
object.
Notice that we have separated the fields here by spaces. Other delimiters
may be specified, notably commas for comma-separated value (CSV) files (as
you'll see in Section 5.2.5). The variable names specified in the first record
must be separated by the same delimiter as used for the data, which is spaces
in this case. If the names themselves contain embedded spaces, as we have
here, they must be quoted.
We read in the file as before, but in this case we state that there is a
header record:

examsquiz (- read.table("exams", header=TRUE )

The column names now appear, with periods replacing blanks:

head(exams quiz)
Exam.1 Exam.2 Quiz
1 2.0 3.3 4.0
2 3.3 2.0 3.7
3 4.0 4.0 4.0
4 2.3 0.0 3.3

Daro rames 103


2.3 1.0 3.3
6 3.3 3.7 4.0

5.2 Other Matrix-Like Operations


Various naurix operations also apply to data frames. Most
notably and
fully, we can do filering to exract various subclaa frames of use-
interest.

5.2.1 Extracting Subdata Frames


As mentioned, a data lrame can be viewed in row-and-column terms. In
particular, we can extract subclata lrames by rows or columns. Here's an
example
examsquiz[2:5,]
Exam.1 Exam.2 Quiz
3. 3 2 3.7
4.0 4 4.0
4 2.3 O 3.3
2.3 1 3.3

examsquiz[2:5,2]
1) 2 4 0 1
class(exams quiz[2:5,2])
[1] "numeric"
examsquiz[2:5,2,drop-FALSE]
Exam.2

5 1

class(examsquiz [2:5,2,drop=FALSE])
[1] "data. frame"

Note that in that second call, since examsquiz[2:5,2] is a vector, R


created a vector instead of another data frame. By specifying drop-FALSE
as described for the matrix case in Section 3.fi, we can keep it as a (one-
column) data frame.
We can also do filtering. Here's how to exuract the subframe of all stu-
dents whose first exam score was at least 3.8:

examsquiz[examsquiz$Exam. 1 = 3.8,]
Exam.1 Exam.2 Quiz
4 4.0 4.0
4 3.3 4.0
11 4 4.0 4.00

14 4 0.0 4.0

16 4 3.7 4.0

104 Chopter 5
19 4 4.0 4.0
22 4 4.0 4.0
25 4 4.0 3.3
29 4 3.0 3.7

5.2.2 More on Treatment of NA Values


Suppose the second exam score for the first student had been missing. Then
we would have
typed the following into that line when we were preparing
the data file:

2.0 NA 4.0

In any subsequent statistical analyses, R would do its best to cope with


the missing data. However, in some situations, we need to set the option
na.rm-TRUE, explicitly telling R to ignore NA values. For instance, with the
missing exam score, calculating the mean score on exam 2 by calling R's
mean() function would skip that first student in finding the mean. Otherwise.
R would just report NA for the mean.
Here's a littde example:
x - c(2, NA, 4)
mean(x)
(1] NA
mean(x, na.rm=TRUE)
[1] 3

In Section 2.8.2, you were introduced to the subset () function, which


saves you the trouble of specifying na. rm=TRUE. You can apply it in data frames
for row selection. The column names are taken in the context of
the given
data frame. In our example, instead of typing this:

examsquiz[examsquizSExam.1 > 3.8,)

we could run this:

subset (examsquiz, Exam. 1 >= 3.8)

Note thal we do not need to write this:

subset (examsquiz,examsquizSExam.1 = 3.8)

In some cases, we may wish to rid our data frame of any observation
that has at least one NA value. A handy function for this purpose is
complete.cases().

Dgg trgmes 105


d4
kids states
Jack CA
2 <NAS A
3 Jillian A
4 John <NA>
complete.cases(d4)
1] TRUE FALSE TRUE FALSE
d5 -
d4[complete.cases(da) .]
kids states
1 Jack CA
3 Jillian MA

Cases 2 and 4 were


incomplete; hence the FALSE values in the
complete.cases(d4). We then use that output to
output
select the inact rows. of

5.2.3 Using the rbind) and cbind() Functions and


The rbind() and Alternatives
cbind () matrix functions introduced in Section 3.4 work with
data frames, too,
providing that you have compatible sizes, of course. For
instance, you can use cbind() to add a
new column that has
as the the same length
exising columns.
In using rbind() to add a row, the added is
another data frame or list.
row
typically in the form of

d
kids ages
1 Jack 12
2 Jill 10

rbind(d,list ("Laura", 19))


kids ages
1 Jack 12
2 Jill 10
3 Laura 19

You can also create new columns from old ones. For
add a variable that
instance, we can
is the difference between exams I and 2:

eq cbind(exams quiz,examsquiz$Exam. 2-examsquiz$Exam. 1)


class(eq)
[1] "data.frame"
head(eq)
Exam.1 Exam.2 Quiz
examsquizSExam.2 examsquizSExam.1
1 2.0 3.3 4.0 1.3
2 3.3 2.0 3.7 1.3

106 Chapte
3 4.0 4.0 4.0 0.0
4 2.3 0.0 3.3 -2.3
5 2.3 1.0 3.3 -1.3
6 3.3 3.7 4.0 0.4

The new name is rather


We could change it, using the
unwieldy: It's long. and it has embedded blanks.
names() function. but it would be better to
exploit the list basis of data Irames and add a column (of the sane
to the data frame for this resul: length)

examsquiz$ExamDiff <- examsquizSExam.2 examsquizSExam.1


head(examsquíz)
Exam.1 Exan.2 Quiz ExanDiff
2.0 3.3 4.0 1.3
3.3 2.0 3.7 -1.3
4.0 4.0 4.0 0.0
4 2.3 0 3.3 -2.3
2.3 1.0 3.3 -1.3
6 3.3 3.7 4.0 0.4

What happened here? Since one can add a new


component to an alreadv
existing ist at any time, we did so: We added a component ExanDiff to the
list/daa frame examsquiz.
We can even
exploit recycling to add a column that is of a diflerent
length than those in the data frame:

d
kids ages
Jack 12
2 ill 10
dsone. 1
d
kids ages one
1 Jack 12
2 11 10

5.2.4 Applying opplyl0


You can use apply() on data frames, if the columns are all of the same type
For instance, we can find the maximum grade for each student, as follows
apply(ex.amsquiz, 1,max)
1) 4.0 3.7 4.0 3.3 3.3 4.0 3.7 3.3 4.0 4.0 4.0 3.) 4.0 4.0 3.7 4.0 3.3 3.7 4.0
20] 3.7 4.0 4.0 3.3 3.3 4.0 4.0 3.3 3.3 4.0 3.7 3.3 3.3 3.7 2.7 3.3 4.0 3.7 3.7
39] 3.7

107
, rargi,f)

Aouy
oshs
oeh. o luny
rai =2 ,
to
5.2.5 Extended Exomple: ASolory Study
cousidered the question, "Ilow
In a study of engineeis and programmers, I
brightest-that people
is, of
the best and the
many of these workers ane changed here.)
(Some of the details have been
cxtraordinary abiliiy?"
had available was linited. One (admittedly
The government data I is
to cdetermine whether a worker is of exraordinary ability
imperfect) way for
to lookthe ratio ofl actual salary to the government prevailing wage
at
is substantially higher than 1.0, you can
thatjob and location. If that ratio has a high level of talent.
assune that this worker
reasonably
I used Rto prepare and analyze the data and will present excerpts of my
I read in the data file:
preparation cocde here, First,

all2006 «- read.csv("2006.csv"', header=TRUE, as. is=TRUE)

to read. table() except


The function read.csv() is essentially identical
which is
that the input data is in the CSV format exported spreacdsheels,
by
of Labor (DOL).
the way the data set was prepared by the US Deparunent
which you saw ear-
The as.is argument is the negation of stringsAsfactors,
lier in Section 5.1. So, settingas.is to TRUE here is simply an alternate way to
achieve stringsAsFactors=FALSE.
data for
At this point, I had a data frame, all2006, consisting of all the
the year 2006. I then did some filtering:

all2006 «- all2006[all2006SWage_Per="Year",] # exclude hourly-wagers


all2006 - all: 20000,] # exclude weird cases
l120065Hage_0ffered_From
all2006 - all2006[all2006$Prevailing_Wage_Amount 200,] # exclude hrly prv wg

data contain
These operations are typical data cleaning. Most large sels

others use different mea


some outlandish values-some are obvious errors,
I needed o remedy his situation before doing
Surement systems, and so on.

any analysis.
I also needed to, create a new column lor the ratio belween actual wage
and prevailing wage:

all2006Srat «- all2006$Wage_Offered_From / all2006$Prevailing_Wage_Amount

new column for


Since I knew I would be calculating the median in this
subsets of the data, I defined a function to do the work:
many

medrat function(dataframe) (
return(median(dataframeSrat, na. rm=TRUE) )

Note the need to exclude NA values, wlhich are common in governiment

data sets.

108 Chapter
I was particularly innterested in three occupations and thus extracted
subdaa frames for them to make their
analyses more covenient:
se2006 « all2006[ grep( "Software Engineer",all2006),
prg2006 all2006[grep("Programner",all2006),
ee2006 al12006{grep("Electronics Engineer",all2006),]

Here, I used R's grep() function to identify the rows containing the given
job title. Details on this function are in Chapter 11.
Another aspect of interest was analysis by firm. I wrote this function to
extract the sublata frame for a given firm:

makecorp function(corpname){
t all20o6[all2006SEmployer_Nane -= corpnae,]
return(t)

I then crealed subdata frames for a number of firms (only some are
shown here).

corplist «- c("MICROSOFT CORPORATION", "ms", "INTEL CORPORATION", "intel","


SUN MICROSYSTEMS, INC.", "sun", "C00GLE INC.", "google")

for (i in 1:(length(corplist)/2)) {
corp - corplist[2*i-1]
newdtf - paste(corplist[2»i],"2006", sep=")
assign(newdtf, makecorp(corp),posS=.GlobalEnv)

bit note that I want


discuss in the above code. First,
There's quite a to
which is the
the variables I'm creating to be at the top (that is, global) level,
l'm new variable
usual place one does interactive analysis. Also, creating my
names irom character such as "intel2006. For these reasons, the
strings,
a variable by its name as
assign() function is wonderful. It allows me to assign
and enables me to specify top level (as discussed in Section 7.8.2).
a string
The paste() function allows ne to concatenate strings, with sep=*" speci-
between strings in my concatenation.
fying that I don't want any characters

5.3 Merging Data Frames


of the most important operations is
In the relational database world, one
combined according to the values
that of a join, in which wo tables can be
frames can be similarly combined
of a common variable. In R, two data
the merge() function.
using
The simplest form is as follows:

merge(x, y)

Dgrg rges 109


This merges data Iames x and
y. lt ass1unmes that the two data
have one or more columns frames
with names in common.
Here's an
cxample:
d1
kids states
Jack CA
Jill MA
3 illian MA
4 John HI
d2
ages kids
10 Jill
2 7 Lillian
12 Jack
d-merge(d1,d2)
d
kids states ages
1 Jack CA 12
2 Jill A 10

Here, the two data frames have the variable kids in common. R found
the rows in which this variable had the same value of kids in both data frames
(the ones for Jack and Jill). It then created a data frame with corresponding
rows and with columns taken from data frames (kids, states, and
ages).
has named arguments by.x and by.y, which handle
The merge() function
Cases in which variables have similar information but different names in the
two data frames. Here's an
example:
d3
ages pals
1 12 Jack
2 10 Jill
3 7 Lillian
merge(d1, d3,by.x="kids", by.y="pals")
kids states ages
1 Jack CA 12
2 Jill MA

Even though our variable was called kids in one data frame and pals in
the other, it was meant to store the same information, and thus the merge
made sense.
matches will appear in full in the result. possibly in undesir
Duplicate
able ways.

d1
kids states
1 Jack CA
2 Jill MA

110 Chople
3 Dillian MA
4 John HI
d2a rbind(d2,list(15,")ill"))
d2a
ages kids
12 Jack
2 10 Jill
7 Lillian
4 15 Jill
merge(d1, dza)
kids states ages
Jack CA .2
2 Jill MA 10
ill MA 5

There are two Jills in dza. There is a


setts and another Jil
jill in di who lives in Massachu-
with unknown residence. In our previous example,
merge(d1, d2), there was only one Jill, who was
both data frames. But here, in he call presumed the.
Lo be same
son in per-
been the case that merge(d1, d2a), it may have
only one of the Jills was a Massachusetts resident. It is
clear from this little
example that you must choose matching variables with
great care.

5.3.1 Extended Example: An Employee Database


The following is an
adaptation of one of my consulting projects. At issue
was whether older workers were
on several
faring as well as
younger ones. I had data
variables, such as age and performance ratings, which I used in
my comparison of the older and younger employees. I also had
ID numbers, which were erucial in employee
being able to connect the two data files:
DA and DB.
The DA file had this header:

"EmpID", "Perf 1", "Perf 2","Perf 3", "Job Title"


-

These are names for the employee ID, three performance raungs, and
the job title. DB had no header. The variables again began with the lID, fol-
lowed by start and end dates of employment.
Both files were in CSV format. Part of my data-cleaning phase consisted
of checking that each record contained the proper number of fields. DA, for
example, should have five fields per record. Here is the check:

count.fields ("DA", sep=", ")


(1S5 5 5 555555 5555 5 555555 5 55555 555555
5555

Doto fromes 111

You might also like