BigData - BCom Unit 4
BigData - BCom Unit 4
Data Frames
Imagine a data frame as something akin to a database table or an Excel
spreadsheet. It has a specific number of columns, each of which is
expected to contain values of a particular data type. It also has an
indeterminate number of rows, i.e. sets of related values for each column.
Assume, we have been asked to store data of our employees (such as
employee ID, name and the project that they are working on). We have
been given three independent vectors, viz., namely, “EmpNo”,
“EmpName” and “ProjName” that holds details such as employee ids,
employee names and project names, respectively.
> EmpNo <- c(1000, 1001, 1002, 1003, 1004)
> EmpName <- c(“Jack”, “Jane”, “Margaritta”, “Joe”, “Dave”)
> ProjName <- c(“PO1”, “PO2”, “PO3”, “PO4”, “PO5”)
However, we need a data structure similar to a database table or an Excel
spreadsheet
that can bind all these details together. We create a data frame by the
name, “Employee” to store all the three vectors together.
> Employee <- data.frame(EmpNo, EmpName, ProjName)
Let us print the content of the date frame, “Employee”.
> Employee
EmpNo EmpName ProjName
1 1000 Jack PO1
2 1001 Jane PO2
3 1002 Margaritta PO3
4 1003 Joe PO4
5 1004 Dave PO5
We have just created a data frame, “Employee” with data neatly
organised into rows and the variable names serving as column names
across the top.
Let us pack the row names in an index vector in order to retrieve multiple
rows.
> Employee [c (“Employee 3”, “Employee 5”),]
EmpNo EmpName ProjName
Employee 3 1002 Margaritta P03
Employee 5 1004 Dave P05
By Providing the Column Name as a String in Double Brackets
> Employee [[“EmpName”]]
[1] Jack Jane Margaritta Joe Dave
Levels: Dave Jack Jane Joe Margaritta
Just to keep it simple (typing so many double brackets can get unwieldy
at times), use
the notation with the $ (dollar) sign.
> Employee$EmpName
[1] Jack Jane Margaritta Joe Dave
Levels: Dave Jack Jane Joe Margaritta
To retrieve a data frame slice with the two columns, “EmpNo” and
“ProjName”, we
pack the column names in an index vector inside the single square
bracket operator.
> Employee[c(“EmpNo”, “ProjName”)]
EmpNo ProjName
1 1000 P01
2 1001 P02
3 1002 P03
4 1003 P04
5 1004 P05
Let us add a new column to the data frame.
To add a new column, “EmpExpYears” to store the total number of years
of experience that the employee has in the organisation, follow the steps
given as follows:
> Employee$EmpExpYears <-c(5, 9, 6, 12, 7)
Print the contents of the date frame, “Employee” to verify the addition of
the new
column.
> Employee
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
dim() Function
The dim()function is used to obtain the dimensions of a data frame. The
output of this function returns the number of rows and columns.
> dim(Employee)
[1] 5 4
The data frame, “Employee” has 5 rows and 4 columns.
nrow() Function
The nrow() function returns the number of rows in a data frame.
> nrow(Employee)
[1] 5
The data frame, “Employee” has 5 rows.
ncol() Function
The ncol() function returns the number of columns in a data frame.
> ncol(Employee)
[1] 4
The data frame, “Employee” has 4 columns.
str() Function
The str() function compactly displays the internal structure of R objects.
We will use
it to display the internal structure of the dataset, “Employee”.
> str (Employee)
‘data.frame’ : 5 obs. of 4 variables:
$ EmpNo : num 1000 1001 1002 1003 1004
$ EmpName : Factor w/ 5 levels “Dave”, “Jack”, ..: 2 3 5 4 1
$ ProjName : Factor w/ 5 levels “P01”, “P02”, “P03”, ..: 1 2 3 4 5
$ EmpExpYears : num 5 9 6 12 7
summary() Function
We will use the summary() function to return result summaries for each
column of the dataset.
> summary (Employee)
EmpNo EmpName ProjName EmpExpYears
Min. : 1000 Dave : 1 P01:1 Min. : 5.0
1st Qu. : 1001 Jack : 1 P02:1 1st Qu. : 6.0
Median : 1002 Jane : 1 P03:1 Median : 7.0
Mean : 1002 Joe : 1 P04:1 Mean : 7.8
3rd Qu. : 1003 Margaritta : 1 P05:1 3rd Qu. : 9.0
Max. : 1004 Max. : 12.0
names() Function
The names()function returns the names of the objects. We will use the
names() function to return the column headers for the dataset,
“Employee”.
> names (Employee)
[1] “EmpNo” “EmpName” “ProjName” “EmpExpYears”
In the example, names(Employee) returns the column headers of the
dataset “Employee”.
The str() function helps in returning the basic structure of the dataset.
This function
provides an overall view of the dataset.
head() Function
The head()function is used to obtain the first n observations where n is set
as 6 by default.
Examples 1: In this example, the value of n is set as 3 and hence, the
resulting output would contain the first 3 observations of the dataset.
> head(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
2. Consider x as the total number of observations. In case of any negative
values as input for n in the head() function, the output obtained is first
x+n observations. In this example, x=5 and n= -2, then the number of
observations returned will be
x + n =5 + (-2)= 3
> head(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
tail() Function
The tail()function is used to obtain the last n observations where n is set
as 6 by default.
> tail(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
Example: Consider the example, where the value of n is negative, and
the output is returned by a simple sum up value of x+n. Here x = 5 and n
=-2. When a negative input is given in the case of the tail()function, it
returns the last x+n observations. The example given as follows returns
the last 3 records from the dataset, “Employee”.
> tail(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
edit() Function
The edit() function will invoke the text editor on the R object. We will use
the edit() function to open the dataset , “Employee” in the text editor.
> edit(Employee)
To retrieve the first three rows (with all columns) from the dataset,
“Employee”, use
the syntax given as follows:
> Employee[1:3,]
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
To retrieve the first three rows (with the first two columns) from the
dataset, “Employee”, use the syntax given as follows:
> Employee[1:3, 1:2]
EmpNo EmpName
1 1000 Jack
2 1001 Jane
3 1002 Margaritta
Notice the use of V1, V2 and V3 as column headings. It means that our
specified column names, “Itemcode”, ItemCategory” and “ItemPrice” are
not considered. In other words, the first line is not automatically treated
as a column header.
Let us modify the syntax, so that the first line is treated as a column
header.
> read.table(“d:/item-tab-sep.txt”,sep=“\t”, header=TRUE)
1 Itemcode ItemQtyOnHan ItemReorderL
d vl
2 I1001 75 25
3 I1002 30 25
4 I1003 35 25
Now let us read the content of the specified file into the data frame,
“ItemDataFrame”.
> ItemDataFrame <- read.table(“D:/item-tab-sep.txt”,sep=“\t”,
header=TRUE)
> ItemDataFrame
Itemcode ItemQtyOnHan ItemReorderL
d vl
1 I1001 75 25
2 I1002 30 25
3 I1003 35 25
Reading from a Table
A data table can reside in a text file. The 1001 Physics 85
cells inside the table are separated by 2001 Chemistry 87
3001 Mathematics 93
blank characters. An example of a table
4001 English 84
with 4 rows and 3 columns is given as
follows:
V1 V2 V3 Copy and paste the table in a file
1 1001 Physics 85 named “d:/mydata.txt” with a text
2 2001 Chemistry 87 editor and then load the data into the
3 3001 Mathematics 93 workspace with the function
4 4001 English 84 “read.table”.
> mydata =
read.table(“d:/mydata.txt”)
> mydata
Merging Data Frames
Let us now attempt to merge two data frames using the merge function.
The merge function takes an x frame (item.csv) and a y frame (item-tab-
sep.txt) as arguments. By
default, it joins the two frames on columns with the same name (the two
“Itemcode”
columns).
> csvitem <- read.csv(“d:/item.csv”)
> tabitem <- read.table(“d:/item-tab-sep.txt”,sep=“\t”,header=TRUE)
> merge (x=csvitem, y=tabitem)
Itemco ItemCategory ItemPri ItemQtyOnHa ItemReorder
de ce nd Lvl
1 I1001 Electronics 700 75 25
2 I1002 Desktop 300 30 25
supplies
3 I1003 Office supplies 350 35 25