0% found this document useful (0 votes)

57 views11 pages

Kids C ("Jack", "Jill") : 5.1 Creating Data Frames

1. The document discusses creating and accessing data frames in R. It provides an example of creating a simple two column data frame from vectors and explains how to access elements using column names or indices. 2. It then discusses an example data frame created from exam score data with variable names in the header. It demonstrates extracting sub data frames by row or column. 3. The document also discusses handling missing values (NAs) in data frames, including using the na.rm and subset functions to exclude rows with NAs from calculations or subsetting.

Uploaded by

Akanksha Mehra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views11 pages

Kids C ("Jack", "Jill") : 5.1 Creating Data Frames

Uploaded by

Akanksha Mehra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

5.

1 Creating Data Frames

To begin, let's take another look at our simple data frame example fron
Section 1.4.5:

kids c("Jack","Jill")
ages - c(12,10)
d - data. frame (kids,ages, stringsAsFactors=FALSE)
d matrix-1ike viewpoint
kids ages
1 Jack 12

2 Jill 10

The fist two arguments in the call to data.frame () are clear: We wish to
produce a data firame from our two vectors: kids and ages. However, that
third argument, stringsAsFactors=FALSE requires more comment.
Ifthe named argument stringsAs Factors is not specified, then by default,
stringsAs Factors will be TRUE. (You can also use options() to arrange the oPpo-
site default.) This means that if we create a data frame from a character
vector-in this case. kids-R will comvert that vector to a factor. Because our
work with character data will typically be with vectors rather than factors,
we'll set stringsAsFactors to FALSE. We'll cover factors in Chapter 6.

5.1.1 Accessing Data Frames

Now that we have a data frame, let's explore a bit. Since d is a list, we can
access it as such via component index values or component names:

d[[1]]
(1] "Jack" "Jill"
d$kids
(1] "Jack" "Jil1"

But we can treat it in a matrix-like fashion as well. For example, we can

view column 1:

d[,1]
1] "Jack" "Jill"

This matrix-like quality is also seen when we take d apart using str():

str(d)
data.frame": 2 obs. of 2 variables:
$ kids: chr "Jack" "Jill"
$ ages: num 12 10

R tells us here that d consists of two observations-our two rows-that

store data on two variables-our two columns

2 Chopler 5
Consiler three vars to access the irst colunnn of our clata
Irane above
dl|111. d.11, and dsk ids. OI ilhese. the tlhird would generally considered to
be learei and, mOre impotantly, saler than the first two. This better iden-
tifies the column and makes it less likely that you will reference the
wrong
colun. Bn in writing geneal code-say writing R
yackages-nmatrix-like
oation dl,1] is nerdenl, and it is esprcially hanly il vou ane extracting sub-
data frnes (as vOu l see wlhen we lalk aboul extracting sullata tranes in
Section 5.2).

5.1.2 Extended Example: Regression Analysis of Exam Grades Continued

Recall our course examination data set in Section 1.5. There, we didn't
have a header, but lor this exanmple we do, and ihe firs1 few records in the
lile now are as follows:

Exam 1" "Exam 2" Quiz

2.0 3.3 4.0
3.3 2.0 3.7
4.0 4.0 4.0
2.3 0.0 3.3
2.3 1.0 3.3
3.3 3.7 4.0

As you can see, each line contains the three for

test scores one student.
This is the classic
rwo-dimensional file notion, like that alluded to in the pre-
ceding output of str(). Here, each line in our file contains the data lor one
observation in a statistical data set. The idea of a daua frame is to encapsulate
such data, along with variable names, into one
object.
Notice that we have separated the fields here by spaces. Other delimiters
may be specified, notably commas for comma-separated value (CSV) files (as
you'll see in Section 5.2.5). The variable names specified in the first record
must be separated by the same delimiter as used for the data, which is spaces
in this case. If the names themselves contain embedded spaces, as we have
here, they must be quoted.
We read in the file as before, but in this case we state that there is a
header record:

examsquiz (- read.table("exams", header=TRUE )

The column names now appear, with periods replacing blanks:

head(exams quiz)
Exam.1 Exam.2 Quiz
1 2.0 3.3 4.0
2 3.3 2.0 3.7
3 4.0 4.0 4.0
4 2.3 0.0 3.3

Daro rames 103

2.3 1.0 3.3
6 3.3 3.7 4.0

5.2 Other Matrix-Like Operations

Various naurix operations also apply to data frames. Most
notably and
fully, we can do filering to exract various subclaa frames of use-
interest.

5.2.1 Extracting Subdata Frames

As mentioned, a data lrame can be viewed in row-and-column terms. In
particular, we can extract subclata lrames by rows or columns. Here's an
example
examsquiz[2:5,]
Exam.1 Exam.2 Quiz
3. 3 2 3.7
4.0 4 4.0
4 2.3 O 3.3
2.3 1 3.3

examsquiz[2:5,2]
1) 2 4 0 1
class(exams quiz[2:5,2])
[1] "numeric"
examsquiz[2:5,2,drop-FALSE]
Exam.2

5 1

class(examsquiz [2:5,2,drop=FALSE])
[1] "data. frame"

Note that in that second call, since examsquiz[2:5,2] is a vector, R

created a vector instead of another data frame. By specifying drop-FALSE
as described for the matrix case in Section 3.fi, we can keep it as a (one-
column) data frame.
We can also do filtering. Here's how to exuract the subframe of all stu-
dents whose first exam score was at least 3.8:

examsquiz[examsquiz$Exam. 1 = 3.8,]
Exam.1 Exam.2 Quiz
4 4.0 4.0
4 3.3 4.0
11 4 4.0 4.00

14 4 0.0 4.0

16 4 3.7 4.0

104 Chopter 5
19 4 4.0 4.0
22 4 4.0 4.0
25 4 4.0 3.3
29 4 3.0 3.7

5.2.2 More on Treatment of NA Values

Suppose the second exam score for the first student had been missing. Then
we would have
typed the following into that line when we were preparing
the data file:

2.0 NA 4.0

In any subsequent statistical analyses, R would do its best to cope with

the missing data. However, in some situations, we need to set the option
na.rm-TRUE, explicitly telling R to ignore NA values. For instance, with the
missing exam score, calculating the mean score on exam 2 by calling R's
mean() function would skip that first student in finding the mean. Otherwise.
R would just report NA for the mean.
Here's a littde example:
x - c(2, NA, 4)
mean(x)
(1] NA
mean(x, na.rm=TRUE)
[1] 3

In Section 2.8.2, you were introduced to the subset () function, which

saves you the trouble of specifying na. rm=TRUE. You can apply it in data frames
for row selection. The column names are taken in the context of
the given
data frame. In our example, instead of typing this:

examsquiz[examsquizSExam.1 > 3.8,)

we could run this:

subset (examsquiz, Exam. 1 >= 3.8)

Note thal we do not need to write this:

subset (examsquiz,examsquizSExam.1 = 3.8)

In some cases, we may wish to rid our data frame of any observation
that has at least one NA value. A handy function for this purpose is
complete.cases().

Dgg trgmes 105

d4
kids states
Jack CA
2 <NAS A
3 Jillian A
4 John <NA>
complete.cases(d4)
1] TRUE FALSE TRUE FALSE
d5 -
d4[complete.cases(da) .]
kids states
1 Jack CA
3 Jillian MA

Cases 2 and 4 were

incomplete; hence the FALSE values in the
complete.cases(d4). We then use that output to
output
select the inact rows. of

5.2.3 Using the rbind) and cbind() Functions and

The rbind() and Alternatives
cbind () matrix functions introduced in Section 3.4 work with
data frames, too,
providing that you have compatible sizes, of course. For
instance, you can use cbind() to add a
new column that has
as the the same length
exising columns.
In using rbind() to add a row, the added is
another data frame or list.
row
typically in the form of

d
kids ages
1 Jack 12
2 Jill 10

rbind(d,list ("Laura", 19))

kids ages
1 Jack 12
2 Jill 10
3 Laura 19

You can also create new columns from old ones. For
add a variable that
instance, we can
is the difference between exams I and 2:

eq cbind(exams quiz,examsquiz$Exam. 2-examsquiz$Exam. 1)

class(eq)
[1] "data.frame"
head(eq)
Exam.1 Exam.2 Quiz
examsquizSExam.2 examsquizSExam.1
1 2.0 3.3 4.0 1.3
2 3.3 2.0 3.7 1.3

106 Chapte
3 4.0 4.0 4.0 0.0
4 2.3 0.0 3.3 -2.3
5 2.3 1.0 3.3 -1.3
6 3.3 3.7 4.0 0.4

The new name is rather

We could change it, using the
unwieldy: It's long. and it has embedded blanks.
names() function. but it would be better to
exploit the list basis of data Irames and add a column (of the sane
to the data frame for this resul: length)

examsquiz$ExamDiff <- examsquizSExam.2 examsquizSExam.1

head(examsquíz)
Exam.1 Exan.2 Quiz ExanDiff
2.0 3.3 4.0 1.3
3.3 2.0 3.7 -1.3
4.0 4.0 4.0 0.0
4 2.3 0 3.3 -2.3
2.3 1.0 3.3 -1.3
6 3.3 3.7 4.0 0.4

What happened here? Since one can add a new

component to an alreadv
existing ist at any time, we did so: We added a component ExanDiff to the
list/daa frame examsquiz.
We can even
exploit recycling to add a column that is of a diflerent
length than those in the data frame:

d
kids ages
Jack 12
2 ill 10
dsone. 1
d
kids ages one
1 Jack 12
2 11 10

5.2.4 Applying opplyl0

You can use apply() on data frames, if the columns are all of the same type
For instance, we can find the maximum grade for each student, as follows
apply(ex.amsquiz, 1,max)
1) 4.0 3.7 4.0 3.3 3.3 4.0 3.7 3.3 4.0 4.0 4.0 3.) 4.0 4.0 3.7 4.0 3.3 3.7 4.0
20] 3.7 4.0 4.0 3.3 3.3 4.0 4.0 3.3 3.3 4.0 3.7 3.3 3.3 3.7 2.7 3.3 4.0 3.7 3.7
39] 3.7

107
, rargi,f)

Aouy
oshs
oeh. o luny
rai =2 ,
to
5.2.5 Extended Exomple: ASolory Study
cousidered the question, "Ilow
In a study of engineeis and programmers, I
brightest-that people
is, of
the best and the
many of these workers ane changed here.)
(Some of the details have been
cxtraordinary abiliiy?"
had available was linited. One (admittedly
The government data I is
to cdetermine whether a worker is of exraordinary ability
imperfect) way for
to lookthe ratio ofl actual salary to the government prevailing wage
at
is substantially higher than 1.0, you can
thatjob and location. If that ratio has a high level of talent.
assune that this worker
reasonably
I used Rto prepare and analyze the data and will present excerpts of my
I read in the data file:
preparation cocde here, First,

all2006 «- read.csv("2006.csv"', header=TRUE, as. is=TRUE)

to read. table() except

The function read.csv() is essentially identical
which is
that the input data is in the CSV format exported spreacdsheels,
by
of Labor (DOL).
the way the data set was prepared by the US Deparunent
which you saw ear-
The as.is argument is the negation of stringsAsfactors,
lier in Section 5.1. So, settingas.is to TRUE here is simply an alternate way to
achieve stringsAsFactors=FALSE.
data for
At this point, I had a data frame, all2006, consisting of all the
the year 2006. I then did some filtering:

all2006 «- all2006[all2006SWage_Per="Year",] # exclude hourly-wagers

all2006 - all: 20000,] # exclude weird cases
l120065Hage_0ffered_From
all2006 - all2006[all2006$Prevailing_Wage_Amount 200,] # exclude hrly prv wg

data contain
These operations are typical data cleaning. Most large sels

others use different mea

some outlandish values-some are obvious errors,
I needed o remedy his situation before doing
Surement systems, and so on.

any analysis.
I also needed to, create a new column lor the ratio belween actual wage
and prevailing wage:

all2006Srat «- all2006$Wage_Offered_From / all2006$Prevailing_Wage_Amount

new column for

Since I knew I would be calculating the median in this
subsets of the data, I defined a function to do the work:
many

medrat function(dataframe) (
return(median(dataframeSrat, na. rm=TRUE) )

Note the need to exclude NA values, wlhich are common in governiment

data sets.

108 Chapter
I was particularly innterested in three occupations and thus extracted
subdaa frames for them to make their
analyses more covenient:
se2006 « all2006[ grep( "Software Engineer",all2006),
prg2006 all2006[grep("Programner",all2006),
ee2006 al12006{grep("Electronics Engineer",all2006),]

Here, I used R's grep() function to identify the rows containing the given
job title. Details on this function are in Chapter 11.
Another aspect of interest was analysis by firm. I wrote this function to
extract the sublata frame for a given firm:

makecorp function(corpname){
t all20o6[all2006SEmployer_Nane -= corpnae,]
return(t)

I then crealed subdata frames for a number of firms (only some are
shown here).

corplist «- c("MICROSOFT CORPORATION", "ms", "INTEL CORPORATION", "intel","

SUN MICROSYSTEMS, INC.", "sun", "C00GLE INC.", "google")

for (i in 1:(length(corplist)/2)) {
corp - corplist[2*i-1]
newdtf - paste(corplist[2»i],"2006", sep=")
assign(newdtf, makecorp(corp),posS=.GlobalEnv)

bit note that I want

discuss in the above code. First,
There's quite a to
which is the
the variables I'm creating to be at the top (that is, global) level,
l'm new variable
usual place one does interactive analysis. Also, creating my
names irom character such as "intel2006. For these reasons, the
strings,
a variable by its name as
assign() function is wonderful. It allows me to assign
and enables me to specify top level (as discussed in Section 7.8.2).
a string
The paste() function allows ne to concatenate strings, with sep=*" speci-
between strings in my concatenation.
fying that I don't want any characters

5.3 Merging Data Frames

of the most important operations is
In the relational database world, one
combined according to the values
that of a join, in which wo tables can be
frames can be similarly combined
of a common variable. In R, two data
the merge() function.
using
The simplest form is as follows:

merge(x, y)

Dgrg rges 109

This merges data Iames x and
y. lt ass1unmes that the two data
have one or more columns frames
with names in common.
Here's an
cxample:
d1
kids states
Jack CA
Jill MA
3 illian MA
4 John HI
d2
ages kids
10 Jill
2 7 Lillian
12 Jack
d-merge(d1,d2)
d
kids states ages
1 Jack CA 12
2 Jill A 10

Here, the two data frames have the variable kids in common. R found
the rows in which this variable had the same value of kids in both data frames
(the ones for Jack and Jill). It then created a data frame with corresponding
rows and with columns taken from data frames (kids, states, and
ages).
has named arguments by.x and by.y, which handle
The merge() function
Cases in which variables have similar information but different names in the
two data frames. Here's an
example:
d3
ages pals
1 12 Jack
2 10 Jill
3 7 Lillian
merge(d1, d3,by.x="kids", by.y="pals")
kids states ages
1 Jack CA 12
2 Jill MA

Even though our variable was called kids in one data frame and pals in
the other, it was meant to store the same information, and thus the merge
made sense.
matches will appear in full in the result. possibly in undesir
Duplicate
able ways.

d1
kids states
1 Jack CA
2 Jill MA

110 Chople
3 Dillian MA
4 John HI
d2a rbind(d2,list(15,")ill"))
d2a
ages kids
12 Jack
2 10 Jill
7 Lillian
4 15 Jill
merge(d1, dza)
kids states ages
Jack CA .2
2 Jill MA 10
ill MA 5

There are two Jills in dza. There is a

setts and another Jil
jill in di who lives in Massachu-
with unknown residence. In our previous example,
merge(d1, d2), there was only one Jill, who was
both data frames. But here, in he call presumed the.
Lo be same
son in per-
been the case that merge(d1, d2a), it may have
only one of the Jills was a Massachusetts resident. It is
clear from this little
example that you must choose matching variables with
great care.

5.3.1 Extended Example: An Employee Database

The following is an
adaptation of one of my consulting projects. At issue
was whether older workers were
on several
faring as well as
younger ones. I had data
variables, such as age and performance ratings, which I used in
my comparison of the older and younger employees. I also had
ID numbers, which were erucial in employee
being able to connect the two data files:
DA and DB.
The DA file had this header:

"EmpID", "Perf 1", "Perf 2","Perf 3", "Job Title"

These are names for the employee ID, three performance raungs, and
the job title. DB had no header. The variables again began with the lID, fol-
lowed by start and end dates of employment.
Both files were in CSV format. Part of my data-cleaning phase consisted
of checking that each record contained the proper number of fields. DA, for
example, should have five fields per record. Here is the check:

count.fields ("DA", sep=", ")

(1S5 5 5 555555 5555 5 555555 5 55555 555555
5555

Doto fromes 111

R Cheat Sheet PDF
100% (1)
R Cheat Sheet PDF
38 pages
CH 03
No ratings yet
CH 03
42 pages
Chapter 2 R Ggplot2 Examples
No ratings yet
Chapter 2 R Ggplot2 Examples
22 pages
Week3 2020
No ratings yet
Week3 2020
20 pages
R Programming Assignment Question
No ratings yet
R Programming Assignment Question
7 pages
Unit 1.3
No ratings yet
Unit 1.3
36 pages
R Assignment
No ratings yet
R Assignment
9 pages
Gries Stefan Thomas (2013) - Statistics For Linguistics With R - 2
No ratings yet
Gries Stefan Thomas (2013) - Statistics For Linguistics With R - 2
100 pages
L3 Notes-1
No ratings yet
L3 Notes-1
8 pages
Unit 1 Factor
No ratings yet
Unit 1 Factor
9 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
Simple Tutorial in R
No ratings yet
Simple Tutorial in R
15 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
Materi 4
No ratings yet
Materi 4
30 pages
R - Tutorial: Matrices Are Vectors
No ratings yet
R - Tutorial: Matrices Are Vectors
13 pages
R
No ratings yet
R
15 pages
Daur Unit 2
No ratings yet
Daur Unit 2
28 pages
Frs Unit - 2
No ratings yet
Frs Unit - 2
27 pages
(R) Internal-2 Q & A
No ratings yet
(R) Internal-2 Q & A
65 pages
Dataframes
No ratings yet
Dataframes
13 pages
Getting Started With R
No ratings yet
Getting Started With R
155 pages
Experiment 5
No ratings yet
Experiment 5
13 pages
An Introduction To R Language
No ratings yet
An Introduction To R Language
11 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
Module 2.9
No ratings yet
Module 2.9
11 pages
Week 1-B. Data in R
No ratings yet
Week 1-B. Data in R
5 pages
R Record
No ratings yet
R Record
16 pages
Lec 13
No ratings yet
Lec 13
46 pages
R Chapter4
No ratings yet
R Chapter4
8 pages
BigData - BCom Unit 4
No ratings yet
BigData - BCom Unit 4
9 pages
R WorkSamples
No ratings yet
R WorkSamples
44 pages
Lecture 5 (Managing and Understanding Data)
No ratings yet
Lecture 5 (Managing and Understanding Data)
9 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Dar Lecture 7
No ratings yet
Dar Lecture 7
24 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
01 IntroSlides
No ratings yet
01 IntroSlides
43 pages
DSF 11-12
No ratings yet
DSF 11-12
21 pages
6 Working With Data Frames in R
No ratings yet
6 Working With Data Frames in R
8 pages
R Study Material I
No ratings yet
R Study Material I
8 pages
R Data Structures - 07 - 4
No ratings yet
R Data Structures - 07 - 4
27 pages
R Programming Materials
No ratings yet
R Programming Materials
51 pages
R Imp Funtions
No ratings yet
R Imp Funtions
10 pages
ADA2 Notes Ch18
No ratings yet
ADA2 Notes Ch18
45 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
RStudio
No ratings yet
RStudio
31 pages
Unit 2
No ratings yet
Unit 2
29 pages
Practical 1 - Data Frame Manipulation - 072502
No ratings yet
Practical 1 - Data Frame Manipulation - 072502
16 pages
R Programming Notes
No ratings yet
R Programming Notes
23 pages
R Pres
No ratings yet
R Pres
53 pages
Base R
No ratings yet
Base R
9 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Advanced SAS Interview Questions You'll Most Likely Be Asked
From Everand
Advanced SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Java Programming Tutorial With Screen Shots & Many Code Example
From Everand
Java Programming Tutorial With Screen Shots & Many Code Example
Desmond Ohwofosirai
No ratings yet
MVS JCL Utilities Quick Reference, Third Edition
From Everand
MVS JCL Utilities Quick Reference, Third Edition
Robert Wingate
5/5 (1)
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
From Everand
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
U.Q. Magnusson
No ratings yet
Basic DBA Query v.1: Oracle Database
From Everand
Basic DBA Query v.1: Oracle Database
Oraclesql-plsql
5/5 (1)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet

Kids C ("Jack", "Jill") : 5.1 Creating Data Frames

Uploaded by

Kids C ("Jack", "Jill") : 5.1 Creating Data Frames

Uploaded by

5.

1 Creating Data Frames

5.1.1 Accessing Data Frames

But we can treat it in a matrix-like fashion as well. For example, we can

R tells us here that d consists of two observations-our two rows-that

5.1.2 Extended Example: Regression Analysis of Exam Grades Continued

Exam 1" "Exam 2" Quiz

As you can see, each line contains the three for

examsquiz (- read.table("exams", header=TRUE )

The column names now appear, with periods replacing blanks:

Daro rames 103

5.2 Other Matrix-Like Operations

5.2.1 Extracting Subdata Frames

Note that in that second call, since examsquiz[2:5,2] is a vector, R

5.2.2 More on Treatment of NA Values

In any subsequent statistical analyses, R would do its best to cope with

In Section 2.8.2, you were introduced to the subset () function, which

examsquiz[examsquizSExam.1 > 3.8,)

we could run this:

subset (examsquiz, Exam. 1 >= 3.8)

Note thal we do not need to write this:

subset (examsquiz,examsquizSExam.1 = 3.8)

Dgg trgmes 105

Cases 2 and 4 were

5.2.3 Using the rbind) and cbind() Functions and

rbind(d,list ("Laura", 19))

eq cbind(exams quiz,examsquiz$Exam. 2-examsquiz$Exam. 1)

The new name is rather

examsquiz$ExamDiff <- examsquizSExam.2 examsquizSExam.1

What happened here? Since one can add a new

5.2.4 Applying opplyl0

all2006 «- read.csv("2006.csv"', header=TRUE, as. is=TRUE)

to read. table() except

all2006 «- all2006[all2006SWage_Per="Year",] # exclude hourly-wagers

others use different mea

all2006Srat «- all2006$Wage_Offered_From / all2006$Prevailing_Wage_Amount

new column for

Note the need to exclude NA values, wlhich are common in governiment

corplist «- c("MICROSOFT CORPORATION", "ms", "INTEL CORPORATION", "intel","

bit note that I want

5.3 Merging Data Frames

Dgrg rges 109

There are two Jills in dza. There is a

5.3.1 Extended Example: An Employee Database

"EmpID", "Perf 1", "Perf 2","Perf 3", "Job Title"

count.fields ("DA", sep=", ")

Doto fromes 111

You might also like