0% found this document useful (0 votes)
7 views9 pages

BigData - BCom Unit 4

This document provides an overview of data frames in R, including how to create, access, and manipulate them. It covers the creation of a data frame for employee data, methods for accessing data using indices and column names, and functions for exploring data such as dim(), nrow(), and summary(). Additionally, it explains how to load data from external files and subset data frames based on specific conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

BigData - BCom Unit 4

This document provides an overview of data frames in R, including how to create, access, and manipulate them. It covers the creation of a data frame for employee data, methods for accessing data using indices and column names, and functions for exploring data such as dim(), nrow(), and summary(). Additionally, it explains how to load data from external files and subset data frames based on specific conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit-IV: EXPLORING DATA IN R

Data Frames
Imagine a data frame as something akin to a database table or an Excel
spreadsheet. It has a specific number of columns, each of which is
expected to contain values of a particular data type. It also has an
indeterminate number of rows, i.e. sets of related values for each column.
Assume, we have been asked to store data of our employees (such as
employee ID, name and the project that they are working on). We have
been given three independent vectors, viz., namely, “EmpNo”,
“EmpName” and “ProjName” that holds details such as employee ids,
employee names and project names, respectively.
> EmpNo <- c(1000, 1001, 1002, 1003, 1004)
> EmpName <- c(“Jack”, “Jane”, “Margaritta”, “Joe”, “Dave”)
> ProjName <- c(“PO1”, “PO2”, “PO3”, “PO4”, “PO5”)
However, we need a data structure similar to a database table or an Excel
spreadsheet
that can bind all these details together. We create a data frame by the
name, “Employee” to store all the three vectors together.
> Employee <- data.frame(EmpNo, EmpName, ProjName)
Let us print the content of the date frame, “Employee”.
> Employee
EmpNo EmpName ProjName
1 1000 Jack PO1
2 1001 Jane PO2
3 1002 Margaritta PO3
4 1003 Joe PO4
5 1004 Dave PO5
We have just created a data frame, “Employee” with data neatly
organised into rows and the variable names serving as column names
across the top.

Data Frame Access


There are two ways to access the content of data frames:
i. By providing the index number in square brackets
ii. By providing the column name as a string in double brackets.
By Providing the Index Number in Square Brackets
Example 1: To access the second column, “EmpName”, we type the
following command at the R prompt.
> Employee[2]
EmpName
1 Jack
2 Jane
3 Margaritta
4 Joe
5 Dave
Example 2: To access the first and the second column, “EmpNo” and
“EmpName”, we type the following command at the R prompt.
> Employee[1:2]
EmpNo EmpName
1 1000 Jack
2 1001 Jane
3 1002 Margaritta
4 1003 Joe
5 1004 Dave
Example 3:
> Employee [3,]
EmpNo EmpName ProjName
3 1002 Margaritta PO3
Please notice the extra comma in the square bracket operator in the
example. It is not a typo.
Example 4: Let us define row names for the rows in the data frame.
> row.names(Employee) <- c(“Employee 1”, “Employee 2”,
“Employee 3”,
“Employee 4”, “Employee 5”)
> row.names (Employee)
[1] “Employee 1” “Employee 2” “Employee 3” “Employee 4” “Employee
5”
> Employee
EmpNo EmpName ProjName
Employee 1 1000 Jack P01
Employee 2 1001 Jane P02
Employee 3 1002 Margaritta P03
Employee 4 1003 Joe P04
Employee 5 1004 Dave P05
Let us retrieve a row by its name.
> Employee [“Employee 1”,]
EmpNo EmpName ProjName
Employee 1 1000 Jack P01

Let us pack the row names in an index vector in order to retrieve multiple
rows.
> Employee [c (“Employee 3”, “Employee 5”),]
EmpNo EmpName ProjName
Employee 3 1002 Margaritta P03
Employee 5 1004 Dave P05
By Providing the Column Name as a String in Double Brackets
> Employee [[“EmpName”]]
[1] Jack Jane Margaritta Joe Dave
Levels: Dave Jack Jane Joe Margaritta
Just to keep it simple (typing so many double brackets can get unwieldy
at times), use
the notation with the $ (dollar) sign.
> Employee$EmpName
[1] Jack Jane Margaritta Joe Dave
Levels: Dave Jack Jane Joe Margaritta
To retrieve a data frame slice with the two columns, “EmpNo” and
“ProjName”, we
pack the column names in an index vector inside the single square
bracket operator.
> Employee[c(“EmpNo”, “ProjName”)]
EmpNo ProjName
1 1000 P01
2 1001 P02
3 1002 P03
4 1003 P04
5 1004 P05
Let us add a new column to the data frame.
To add a new column, “EmpExpYears” to store the total number of years
of experience that the employee has in the organisation, follow the steps
given as follows:
> Employee$EmpExpYears <-c(5, 9, 6, 12, 7)
Print the contents of the date frame, “Employee” to verify the addition of
the new
column.
> Employee
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7

Ordering the Data Frames


Let us display the content of the data frame, “Employee” in ascending
order of
“EmpExpYears”.
> Employee[order(Employee$EmpExpYears),]
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
3 1002 Margaritta P03 6
5 1004 Dave P05 7
2 1001 Jane P02 9
4 1003 Joe P04 12
Use the syntax as shown next to display the content of the data frame,
“Employee” in
descending order of “EmpExpYears”.
> Employee[order(-Employee$EmpExpYears),]
EmpNo EmpName ProjName EmpExpYears
4 1003 Joe P04 12
2 1001 Jane P02 9
5 1004 Dave P05 7
3 1002 Margaritta P03 6
1 1000 Jack P01 5
R Functions for understanding Data in Data Frames
We will explore the data held in the data frame with the help of the
following R
functions:
 dim()  names()
 nrow()  head()
 ncol()  tail()
 str()  edit()
 summary()

dim() Function
The dim()function is used to obtain the dimensions of a data frame. The
output of this function returns the number of rows and columns.
> dim(Employee)
[1] 5 4
The data frame, “Employee” has 5 rows and 4 columns.
nrow() Function
The nrow() function returns the number of rows in a data frame.
> nrow(Employee)
[1] 5
The data frame, “Employee” has 5 rows.
ncol() Function
The ncol() function returns the number of columns in a data frame.
> ncol(Employee)
[1] 4
The data frame, “Employee” has 4 columns.
str() Function
The str() function compactly displays the internal structure of R objects.
We will use
it to display the internal structure of the dataset, “Employee”.
> str (Employee)
‘data.frame’ : 5 obs. of 4 variables:
$ EmpNo : num 1000 1001 1002 1003 1004
$ EmpName : Factor w/ 5 levels “Dave”, “Jack”, ..: 2 3 5 4 1
$ ProjName : Factor w/ 5 levels “P01”, “P02”, “P03”, ..: 1 2 3 4 5
$ EmpExpYears : num 5 9 6 12 7
summary() Function
We will use the summary() function to return result summaries for each
column of the dataset.
> summary (Employee)
EmpNo EmpName ProjName EmpExpYears
Min. : 1000 Dave : 1 P01:1 Min. : 5.0
1st Qu. : 1001 Jack : 1 P02:1 1st Qu. : 6.0
Median : 1002 Jane : 1 P03:1 Median : 7.0
Mean : 1002 Joe : 1 P04:1 Mean : 7.8
3rd Qu. : 1003 Margaritta : 1 P05:1 3rd Qu. : 9.0
Max. : 1004 Max. : 12.0
names() Function
The names()function returns the names of the objects. We will use the
names() function to return the column headers for the dataset,
“Employee”.
> names (Employee)
[1] “EmpNo” “EmpName” “ProjName” “EmpExpYears”
In the example, names(Employee) returns the column headers of the
dataset “Employee”.
The str() function helps in returning the basic structure of the dataset.
This function
provides an overall view of the dataset.
head() Function
The head()function is used to obtain the first n observations where n is set
as 6 by default.
Examples 1: In this example, the value of n is set as 3 and hence, the
resulting output would contain the first 3 observations of the dataset.
> head(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
2. Consider x as the total number of observations. In case of any negative
values as input for n in the head() function, the output obtained is first
x+n observations. In this example, x=5 and n= -2, then the number of
observations returned will be
x + n =5 + (-2)= 3
> head(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
tail() Function
The tail()function is used to obtain the last n observations where n is set
as 6 by default.
> tail(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
Example: Consider the example, where the value of n is negative, and
the output is returned by a simple sum up value of x+n. Here x = 5 and n
=-2. When a negative input is given in the case of the tail()function, it
returns the last x+n observations. The example given as follows returns
the last 3 records from the dataset, “Employee”.
> tail(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
edit() Function
The edit() function will invoke the text editor on the R object. We will use
the edit() function to open the dataset , “Employee” in the text editor.
> edit(Employee)
To retrieve the first three rows (with all columns) from the dataset,
“Employee”, use
the syntax given as follows:
> Employee[1:3,]
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
To retrieve the first three rows (with the first two columns) from the
dataset, “Employee”, use the syntax given as follows:
> Employee[1:3, 1:2]
EmpNo EmpName
1 1000 Jack
2 1001 Jane
3 1002 Margaritta

A brief summary of functions for exploring data in R


Function Name Description
nrow(x) Returns the number of rows
ncol(x) Returns the number of columns
str(mydata) Provides structure to a dataset
summary(mydata) Provides basic descriptive statistics and
frequencies
edit(mydata) Opens the data editor
names(mydata) Returns the list of variables in a dataset
head(mydata) Returns the first n rows of a dataset. By
default, n = 6
head(mydata, Returns the first 10 rows of a dataset
n=10)
head(mydata, n= - Returns all the rows but the last 10
10)
tail(mydata) Returns the last n rows. By default, n = 6
tail(mydata, n=10) Returns the last 10 rows
tail(mydata, n= - Returns all the rows but the first 10
10)
mydata[1:10, ] Returns the first 10 rows
mydata[1:10,1:3] Returns the first 10 rows of data of the first 3
variables

Load Data Frames


Let us look at how R can load data into data frames from external files.
Reading from a .csv (comma separated values file)
We have created a .csv file by the name, “item.csv” in the D:\ drive. It has
the following content:
A B C
1 Itemcode ItemCategory ItemPrice
2 I1001 Electronics 700
3 I1002 Desktop 300
supplies
4 I1003 Office supplies 350

Let us load this file using the read.csv function.


> ItemDataFrame <- read.csv(“D:/item.csv”)
> ItemDataFrame
Itemcode ItemCategory ItemPrice
1 I1001 Electronics 700
2 I1002 Desktop 300
supplies
3 I1003 Office supplies 350
Subsetting Data Frame
To subset the data frame and display the details of only those items
whose price is greater than or equal to 350.
> subset(ItemDataFrame, ItemPrice >=350)
Itemcode ItemCategory ItemPrice
1 I1001 Electronics 700
3 I1003 Office supplies 350
To subset the data frame and display only the category to which the items
belong (items whose price is greater than or equal to 350).
> subset(ItemDataFrame, ItemPrice >=350, select =
c(ItemCategory))
ItemCategory
1 Electronics
3 Office supplies
To subset the data frame and display only the items where the category is
either “Office supplies” or “Desktop supplies”.
> subset(ItemDataFrame, ItemCategory == “Office supplies” |
ItemCategory
== “Desktop supplies”)
Itemcode ItemCategory ItemPrice
2 I1002 Desktop 300
supplies
3 I1003 Office supplies 350
Reading from a Tab Separated Value File
For any file that uses a delimiter other than a comma, one can use the
“read.table” command.
Example: We have created a tab separated file by the name, “item-tab-
sep.txt” in the D:\ drive. It has the following content.
Itemcode ItemQtyOnHan ItemReorderL
d vl
I1001 75 25
I1002 30 25
I1003 35 25
Let us load this file using the “read.table” function. We will read the
content from the file but will not store its content to a data frame.
> read.table(“d:/item-tab-sep.txt”,sep=“\t”)
V1 V2 V3
1 Itemcode ItemQtyOnHan ItemReorderL
d vl
2 I1001 75 25
3 I1002 30 25
4 I1003 35 25

Notice the use of V1, V2 and V3 as column headings. It means that our
specified column names, “Itemcode”, ItemCategory” and “ItemPrice” are
not considered. In other words, the first line is not automatically treated
as a column header.
Let us modify the syntax, so that the first line is treated as a column
header.
> read.table(“d:/item-tab-sep.txt”,sep=“\t”, header=TRUE)
1 Itemcode ItemQtyOnHan ItemReorderL
d vl
2 I1001 75 25
3 I1002 30 25
4 I1003 35 25
Now let us read the content of the specified file into the data frame,
“ItemDataFrame”.
> ItemDataFrame <- read.table(“D:/item-tab-sep.txt”,sep=“\t”,
header=TRUE)
> ItemDataFrame
Itemcode ItemQtyOnHan ItemReorderL
d vl
1 I1001 75 25
2 I1002 30 25
3 I1003 35 25
Reading from a Table
A data table can reside in a text file. The 1001 Physics 85
cells inside the table are separated by 2001 Chemistry 87
3001 Mathematics 93
blank characters. An example of a table
4001 English 84
with 4 rows and 3 columns is given as
follows:
V1 V2 V3 Copy and paste the table in a file
1 1001 Physics 85 named “d:/mydata.txt” with a text
2 2001 Chemistry 87 editor and then load the data into the
3 3001 Mathematics 93 workspace with the function
4 4001 English 84 “read.table”.
> mydata =
read.table(“d:/mydata.txt”)
> mydata
Merging Data Frames
Let us now attempt to merge two data frames using the merge function.
The merge function takes an x frame (item.csv) and a y frame (item-tab-
sep.txt) as arguments. By
default, it joins the two frames on columns with the same name (the two
“Itemcode”
columns).
> csvitem <- read.csv(“d:/item.csv”)
> tabitem <- read.table(“d:/item-tab-sep.txt”,sep=“\t”,header=TRUE)
> merge (x=csvitem, y=tabitem)
Itemco ItemCategory ItemPri ItemQtyOnHa ItemReorder
de ce nd Lvl
1 I1001 Electronics 700 75 25
2 I1002 Desktop 300 30 25
supplies
3 I1003 Office supplies 350 35 25

You might also like