0% found this document useful (0 votes)
33 views

Data Analysis Using R - 4

Uploaded by

harshvasudevkoli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Data Analysis Using R - 4

Uploaded by

harshvasudevkoli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

DAR

4. Working with Data

4.1 Functions in R

 In R, functions are blocks of code that perform a specific task or set of tasks. They are essential for
organizing and reusing code, making our R programs more modular and easier to maintain.
 R has a large number of in-built functions and the user-defined functions where user can create
their own functions.
 Types of Function in R Language:
 Built-in Function: Built-in functions in R are pre-defined functions that are available in R
programming languages to perform common tasks or operations.
 User-defined Function: R language allow us to write our own function.

4.1.1 User-defined Function:

 Creating a Function: An R function is created by using the keyword function. The basic syntax
of an R function definition is as follows.
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Example:
my_function <- function() { # create a function with the name my_function
print("Hello World!")
}
 Call a Function: To call a function, use the function name followed by parenthesis, like
my_function()
Example:
my_function() # call the function named my_function
 Function Arguments: Functions can have arguments (parameters) that allow you to pass
values to the function for processing.Information can be passed into functions as arguments.
Arguments are specified after the function name, inside the parentheses. we can add as many
arguments as you want, just separate them with a comma.
my_function <- function(fname)
{
paste(fname, "Griffin")
}
my_function("Peter")
my_function("Lois")
my_function("Stewie")
#Output:
[1] "Peter Griffin"
[1] "Lois Griffin"
[1] "Stewie Griffin"

4.1.2 Lazy Evalution of Function:

 In R the functions are executed in a lazy fashion. When we say lazy what it means is if some
arguments are missing the function is still executed as long as the execution does not involve
those arguments.
 This can help improve performance and reduce unnecessary computation.
 In R, lazy evaluation is primarily associated with function arguments.
Example:
 calculate= function(a,b){
square<-a^2
return(square)
}
# This'll execute because this b is not used in the
# calculations inside the function.
print(calculate(5))
#Output:[1] 25

 calculate= function(a,b){
add<-a+b
return(add)
}
# This'll throw an error
print(calculate(5))
#Output: Error in calculate(5) : argument "b" is missing, with no default
Calls: print -> calculate
Execution halted

4.2 Import and Export data to/from Text and CSV file

 To import data from excel file into R, it is required to prepare data, i.e. data must be in aproper
format.
• Following are some of the formatting options:
1. First row of excel spreadsheet is usually a header. Try to reserve first row for header.
2. Avoid fields/values with blank spaces. E.g. if field is enrolment number then avoid blank
space between these two words otherwise it will be considered as 2 separate words. You may
use enrolment.number as a field/value. Use dot operator to concate twowords.
3. Try to avoid names containing special characters like @,&,#,%,+,/,(,),{,},[,<, etc
4. Delete comments from excel file if any. It will be considered as separate column.
5. Try to indicate missing values in excel file as NA.
6. The common extensions to save excel file are .xls, .xlsx, but you may also save your excel file
as .txt or .csv
7. Depending on the type/extension of file, data fields are separated either by tabs or
by commas.
8. After all above preparations, file is now ready to import into R.
In R, there are two options to import data, through commands or through packages.
• Basic R commands are stored in Utils package which is a built-in R package that stores utility
functions.
• Following are commands to import excel file into R:
 Import Data from Text File into R:
1. read.table():
• If excel file is stored as .txt, then read.table command is used to read text file.
 demo<-read.table("filename.txt",header=TRUE,sep="/",strip.white=TRUE)
• Here demo is the name of the file in R where we are importing our text file.
• filename.txt is the name of file to import. You need to specify complete file path.
• Header=TRUE, is used when excel file has first row as header.
• Usually text file uses tabs as a separator. If our file is using any symbol other than tab as
a separator then sep parameter is used to indicate that separator symbol.
Here sep=“/“ it means input file is using / as a separator.
• strip.white=TRUE is used if we want to strip/clean white spacesfrom unquoted
characters in input file. It is used with sep parameter only.
• If input file does not contain header, R automatically assigns some default headers to it.
--------------------------------------------------------------------------------------------------------------------------------

Export data from R to text file

2. write.table():

It is used to export data from R to external file.


 write.table(student, “studinfo.txt”, row.names=FALSE)
• student is the name of R object which is to be exported.
• studinfo.txt is the name of file in which data is to be exported. By default it will create target
file in current working directory. It is required to specify complete file path if we want to
change target file location.
• row.names parameter is set to FALSE if we don’t want to export row names. By default it
is set to TRUE.
 Import data from csv file into R

1. read.csv()/read.csv2: comma separated file.

• If excel file is stored as .csv, then read.csv command is used to read .csv file.
 demo<-read.csv/read.csv2(file="filename.csv",header=TRUE,stringsAsFactors=FALSE,
strip.white=TRUE)
o Here demo is the name of the file in R where we are importing our csv file.
o filename.csv is the name of file to import. You need to specify complete file path.
o Header=TRUE, is used when excel file has first row as header.
o strip.white=TRUE is used if we want to strip/clean white spaces from unquoted characters in
input file. It is used with sep parameter only.
o stringsAsFactors specifies whether strings should be considered as factors.
o Usually csv file uses commas as a separator. If our file is using “;” as a separator then we should
use read.csv2() command to import that file.
--------------------------------------------------------------------------------------------------------------------------------

Export data from R to csv file


2. write.csv ():
• It is used to export data from R to external file.
 write.csv(student, “studinfo.csv”. row.names=FALSE)
• student is the name of R object which is to be exported.
• studinfo.csv is the name of file in which data is to be exported. By default it will create target
file in current working directory. It is required to specify complete file path if we want to
change target file location.
• row.names parameter is set to FALSE if we don’t want to export row names. By default it is
set to TRUE.
Deleting data from file : (Deleting the Pages Column)

4.3 Import and Export data to/from Excel file :


 Import data from Excel file into R
 Many packages are available in R to import excel file. We need to load these packages alongwith
the library to use their functions.
i. XLConnect package:
• install.packages("XLConnect") #install package
• library(XLConnect) #install library
1.readWorksheetFromFile():
 demo<-readWorksheetFromFile("filename.extension",sheet=1,startRow=2,endRow=10,endCol =3)
• readWorksheetFromFile command is used to read a specified sheet from excel file.
• sheet parameter is used to specify sheet number/index to be read.
• startRow or startCol indicates from which row or column data should be imported.
• endRow or endCol indicates up to which row or column data should be imported.
• If row or column index is not specified, it always read from row and column 1.
• Alternatively entire workbook can be read and then we can select a sheet to be read by using
following command:
 wb<-loadWorkbook("studinfo.xlsx") #Load complete workbook into R
 getSheets(wb) # get the list of sheets in workbook
 demo<-readWorksheet(wb,sheet=2) # then read required sheet
 wb<-loadWorkbook("studinfo.xlsx", create=TRUE) #create workbook if not exist
createSheet(wb, name=”TYFS") # create new sheet named “TYFS”
ii. xlsx package:
• install.packages("xlsx")
• library(xlsx)
2. read.xlsx:
 demo<-read.xlsx("filename.extension",sheetIndex=1, rowIndex=5,colIndex=3)
• sheetIndex specifies index of sheet to be read.
• rowIndex and colIndex specify the row and column index from which data should be read.
Export data from R to Excel file.
----------------------------------------------------------------------------------------------------------------------------------
iii. readxl package:
• install.packages("readxl")
• library(readxl)
3.read_excel():
 demo<-read_excel("filename.extension",sheet=4, skip=2)
 read_excel command is used to read specified file with given sheet index.
• Above command reads sheet no. 4 by skipping first 2 rows.

 Skip first 2 columns from excel file.

 Changing column names .

--------------------------------------------------------------------------------------------------------------------------------
 Export data from R into Excel file .
1.writeWorksheet():
 writeWorksheet(wb, TYFSdf,sheet=”TYFS", startRow=1, startCol=1)
• writeWorksheet() writes data from R dataframe into new worksheet of newly created workbook.
• In above command:
• wb is the name of newly created workbook
• TYFS is the name of worksheet created in workbook wb.
• TYFSdf is the name of dataframe to be written into worksheet.
• startRow, startCol used to mention row and column index from which data
writing will start.
 saveWorkbook(wb) #save workbook and write file to disk in current working dir.

--------------------------------------------------------------------------------------------------------------------------------

2.write.xlsx():

 write.xlsx(demo,"emp.xlsx") # write data from demo dataframe to emp file

 write.xlsx(demo,"emp.xlsx",sheetName= “new_emp”, append=TRUE)

# write data from demo dataframe to existing emp file by creating new worksheet named new_emp

append parameter is used to append dataframe to existing file.

--------------------------------------------------------------------------------------------------------------------------------

i. writexl package:
• install.packages("writexl”)
• library(writexl)
3.write_excel():
 write_excel(demo,"bookinfo.xlsx ", row.names=FALSE)
• write_excel command is used to write data frame to excel file.
• row.names parameter is set to FALSE if we don’t want to export row names. By default it
is set to TRUE.

4.4 Database connectivity via ODBC :

 Import data, perform different operation on it :

1. RMySQL package: RMySQL is a database interface and MySQL driver for R.


• install.packages("RMySQL")
• library(RMySQL)
• Make a connection object:
 con<-dbConnect(MySQL(), user=”root”, password=”Pass@123”, host=”localhost”,
dbname=”employee”)
• MySQL() function creates a driver object for MySQL.
• user, password and host are the values that are set while installation of MYSQL.
• dbname is the name of database to be connected with.

• Following are some of the commands to work with the MySQL environment:
1. Get connection summary:
 summary(con)

2. Get Database information:


 dbGetInfo(con)

3. Show tables in connected database:


 dbListTables(con)

4. Show fields in any table:


 dbListFields(con,”marketing”) #display fields from marketing table.

5. Remove any table from database:


 dbRemoveTable(con,”marketing”) #remove table marketing from connected database.

6. Read entire table from database:


 dbReadTable(con,”testing”) #read table “testing”

7. Extract rows from table:


dbSendQuery() submits and executes SQL query to database engine.
o It does not extract any record. dbFetch or fecth() functions are used to fetch records.
o dbGetQuery() can also be used for interactive session.

 market<-dbSendQuery(con, “select * from marketing;”)


 market<-dbSendQuery(con, “select * from marketing where sal>5000 ;”)
 market_data<-dbFetch/fetch(market) #fetch all rows from marketing table.
 dbGetRowCount(market_data) #get number of rows fetched.
 market_data<-dbFetch/fetch(market, n=10) #fetch first 10 rows from marketing table.
 market_data<-dbFetch/fetch(market, n=-1) #fetch first all rows from marketing table.

8. Get count of number of rows affected by query:


 dbGetRowsAffected(market)

 Export data to database

1. Execute various queries on database:


 dbSendQuery(con, insert into testing values(15,”jack”,6000);)
 dbSendQuery(con, update testing set salary=7000 where empno=10);)
 dbSendQuery(con,”drop table if exists marketing”;)

2. Clear data/free resources


 dbClearResult(market)

3. Overwrite table in the database:


 dbWriteTable(con,”testing”,”new_test”,overwrite=TRUE) #overwrite table testing with new_test.

4. Append data to the table in the database:


 dbWriteTable(con,”testing”,”new_test”,append=TRUE) #append new_test to the testing table.

5. Disconnect from database:


 dbDisconnect(con)

4.5 Import XML file:

• install.packages("XML")
• library(“XML”),
• library(“methods”)

• Import XML file:


 emp<-xmlParse(file=”employee.xml”)
 print(emp) #produces list as output

• Extract root node of fetched file:


 root<-xmlRoot(emp)

• Find number of nodes in xml file:


 filesize<-xmlSize(root)

• Print specific node from file:


 print(root[1]) #display data from 1st node.
 print(root[[1]][[3]]) #display 3rd component/element of 1st node.

 Convert XML file to Dataframe


 empdf<-xmlToDataFrame(“employee.xml”)

 Export dataframe to xml file:


• Install.packages(“kulife”) #not mandatory to install
 write.xml(new_emp,”newemp.xml”) #export dataframe new_emp to newemp.xml file.
4.6 Graphical data analysis :

4.6.1 Simple Graph: plot( )


• Plot() in R is used to plot points in a graph.
• It is a generic function that has many methods which are called based on the type of input object
passed to it.
• Plot() is basically used to create plot a scatter plot or line graph of 2 vectors. i.e it is used to plot 2
vectors against each other.
• It develops a 2-dimensional graph.
• syntax: plot(x, y, type, main, xlab, ylab, col, cex, pch, lwd, lty)
• x and y are two input vectors corresponds to X and Y axis resp.
• type is a code used to specify the type of plot
- “p” to plot points only
-“l” to plot line only
- “b” to plot both points and line
- “c” to join empty points with line
- “o” to plot both over-plotted pointes and line
- “h” to plot histogram plot
-“s” to plot stair steps
• main parameter is used to give title to plot
• xlab and ylab are used to specify labels of X and Y axis resp.
• col is used to specify colour of points and line.
• cex specifies the size of points. 1 is default size.
• pch is used to specify shape of points. Value of pch ranges from 0 to 25.
• lwd specifies line width. Default width is 1.
• lty specifies specifies line style. Line format ranges from 0 to 6.
- 0 removes line
- 1 displays solid line
- 2 displays dashed line
- 3 displays dotted line
-4 displays “dot dashed” line
- 5 displays “long dashed” line
- 6 displays “two dashed”(long and short dashes) line
• Example:
 months <-(1:12)
temp21<-c(19.5, 22.3, 24.4, 27.2, 31.9, 31.0, 30.5, 28.0, 27.4, 25.2, 23.1, 20.0)
plot(months, temp21, type="b", main="temperature-2021", col="blue", cex=2, pch=2, lwd=2,
lty=1)
> To change the label of x and y axis using xlab and ylab:
plot(months, temp21, type="b",ylab = "temperature" ,main="temperature-2021", col="blue",
cex=1, pch=2, lwd=2)

 Comparison of 2 plots using points( ) and lines( ) function:


 temp20<-c(18,20.3,22.4,24.2,27.9,29,30,27,25.4,23.2,21.1,20)
 plot(months, temp20, type="b", main="temp 2020 vs 2021", col="blue",cex=2, pch=2, lwd=2, lty=1)
 points(months,temp21,col="red", cex=2,type=“b")
OR
 lines(months,temp21,col="red", cex=2,type=“b")
 legend(x="topright", title=“years", legend=c(“2020", “2021"),fill=c("blue","red"))

Comparing 2 graphs:
>plot(months, temp21, type="b",ylab = "temperature",main="Temperature 2020 vs 2021",col
="blue", cex=1, pch=2, lwd=2)
> points(months,temp20,col="red", cex=2,type="b")
> legend(x="topright",title = "Years",legend = c("2020","2021"),fill=c("red","blue"))
Plot using Two different inputs (list and dataframe):

>plot(classA$rno, classA$marks, type="b",xlab= "Roll No.", ylab = "Percentage",


main="Class A vs Class B", col="orange", cex=1, pch=2, lwd=2)
> points(classbB$rno,classbB$marks,type = "b",col="green")
> legend(x="topright",title = "Class",legend = c("Class A","Class B"),fill = c("orange",
"green"))

 Save file using commands:


1. Save as jpeg image:
> jpeg(file="temparature.jpeg")
> plot(months, temp, type="b", main="temperature-2021", col="blue", cex=2, pch=2, lwd=2, lty=1)
> dev.off( )
Note: dev.off( ) is used to shut down current device. Here it closes down
current plot.

2. Save as png image:


> png(file="temparature.png")
> plot(months, temp, type="b", main="temperature-2021", col="blue",
cex=2, pch=2, lwd=2, lty=1)
> dev.off( )
OR
> png(file="temparature.png",width=600,height=350)

3. Save as bmp image:


> bmp(file="temparature.bmp”)
> plot(months, temp, type="b", main="temperature-2021", col="blue", cex=2, pch=2, lwd=2, lty=1)
> dev.off( )
OR
>bmp (file="temparature.bmp", width=6,height=4.5, units=“in”, res=100)

4. Save as pdf:
> pdf(file=“temparature.pdf”)
> plot(months, temp, type="b", main="temperature-2021", col="blue", cex=2, pch=2, lwd=2, lty=1)
> dev.off( )

4.6.2 Pie Chart

• Pie chart is a circular graph that indicates numerical proportions in slices.


• It is used to show contributions of slices into the entire graph.
• syntax: pie(x, labels, radius, main, col, clockwise)
• x is a input numeric vector.
• labels are used to specify descriptions of slices.
• radius of a circle.
• col used to give colors to slices of chart.
• clockwise indicates whether slices drawn clockwise or anti clockwise.
Example:
 books<-c(“Biography”, ”comic”, ”poetry”, “story”, “fashion magazines”,“Cookbook”, “Fiction”)
 readers<-c(20,30,50,25,32,40,35)
 pie(readers,labels=books, main="readers survey", col=rainbow(length(books)))

 How to calculate percentage of book reader for above pie chart:


 per<-round(readers/sum(readers)*100)
 lblread<-paste(books,”-”,per,”%”)
 pie(readers, labels=lblread, radius=0.5, main="readers survey",col=rainbow(length(books)))

 How to draw legend for pie chart:


 per2<-paste(round(readers/sum(readers)*100),"%")
 pie(readers, labels=per2, radius=0.7, main="readers survey",col=rainbow(length(books)))
 legend(x="topright",cex=0.7,title="book type", legend=books, fill=rainbow(length(books)))
 3D Pie Chart in R
• 3D Pie chart is created by using pie3D( ).
• plotrix package is required.
• install.packages(“plotrix”)
• library(“plotrix”)
 pie3D(readers, labels=books, explode=0.05, main="readers survey", col=rainbow(length(books)))

4.6.3 Bar Chart


• Bar chart is a graph with rectangular bars.
• It represents categorical data.
• The height of bars proportional to the values they represent.
• syntax: barplot(x, xlab, ylab, main, names.arg, col)
• x is a input numeric vector or matrix to represent bars.
• xlab and ylab are the labels for X and Y axes respectively.
• main is the title of bar chart.
• names.arg is the name of vectors appearing under each bar.
• col used to give colors to the bars.
Dataframe 1
 event #dataframe
ename scount
1 coding 15
2 paper 25
3 project 20
4 roborace 40
• Simple barchart:
 barplot (event$scount,names.arg=event$ename)
•barchart with some more parameters:
 barplot (event$scount,names.arg=event$ename,main = "Event Analysis",xlab = "Event names",
ylab ="No.of students",col = rainbow(4))

•change the width of each bar in barchart :


 barplot (event$scount,names.arg=event$ename,main = "Event Analysis",xlab = "Event names",
ylab ="No.of students",col = rainbow(4),width=c(0.2,0.5,0.4,1))

•barchart with different color palettes:


 barplot (event$scount,names.arg=event$ename,main = "Event Analysis",xlab = "Event names",
ylab ="No.of students",col = rainbow(4)) #as above graph
 barplot (event$scount,names.arg=event$ename,main = "Event Analysis",xlab = "Event names",
ylab ="No.of students",col=heat.colors(4))
 barplot (event$scount,names.arg=event$ename,main = "Event Analysis",xlab = "Event names",
ylab ="No.of students",col=terrain.colors(4))

 barplot (event$scount,names.arg=event$ename,main = "Event Analysis",xlab = "Event names",


ylab ="No.of students",col=topo.colors(4))

•Print numbers(values) in each bar using text function and make box around the graph :
 value <- barplot (event$scount,names.arg=event$ename,main = "Event Analysis",xlab = "Event
names",ylab ="No.of students",col=topo.colors(4))
 text(value , 0 , event$scount , cex=1 , pos=3)
 box() #creates a box around the graph
•Horizontal barchart :
 barplot (event$scount,names.arg=event$ename,main = "Event Analysis",ylab = "Event names",
xlab ="No.of students",col=topo.colors(4), horiz = TRUE)

Dataframe 2
> event1
ename boys girls
1 coding 15 20
2 paper 25 23
3 project 20 34
4 roborace 40 39
• for above dataframe simple bar chart can not be plotted. It has two categories of student(boys, girls).
We need to create grouped or stacked bar chart for it.
• first we need to create a matrix for those two columns columns.
 emat<-matrix(c(boys, girls), nrow=2,ncol=4,byrow=TRUE)
 rownames (emat)=c("boys","girls")
 colnames (emat)=c("coding", "paper", "project", "roborace")
 emat
Output:
coding paper project roborace
boys 15 25 20 40
girls 20 23 34 39

 Grouped barchart :
 barplot(emat,xlab="events",ylab="no.of.students",main="Event Analysis" ,names.arg=c("coding",
"paper","project","roborace"), col=c(7,2),beside=TRUE)
• using beside=TRUE, a grouped bar chart is created.
 Print numbers in each bar using text function:
• first store X-Y coordinates of bar chart in one vector
 img<barplot(emat,xlab="events",ylab="no.of.students",main="EventAnalysis" ,names.arg=c(
"coding","paper","project", "roborace"), col=c(7,2),beside=TRUE)
 text(img,0,emat,cex=1,pos=3)

 Stacked bar chart :


 img<barplot(emat,xlab="events",ylab="no.of.students",main="EventAnalysis" ,names.arg=c(
"coding","paper","project", "roborace"), col=c(7,2))
 legend(x="top",cex=1,legend=c("girls","boys"),fill=c(2,7))
• By omitting beside parameter stacked bar chart can be created.
 Another Example:
Creating a dataframe bookdata :
> bookdata
bid bname copies month
1 1 C 25 jan
2 2 C++ 12 jan
3 3 dbms 15 jan
4 4 java 20 jan
5 5 mongoDB 8 jan
6 1 C 11 feb
7 2 C++ 3 feb
8 3 dbms 23 feb
9 4 java 8 feb
10 5 mongoDB 10 feb

Converting the dataframe into matrix:

 bmat<-matrix(bookdata$copies,nrow = 5,ncol = 2)> colnames(bmat)<-c("jan","feb")> rownames


(bmat)<-c("C","C++","dbms","java","mongoDB")> bmat
jan feb
C 25 11
C++ 12 3
dbms 15 23
java 20 8
mongoDB 8 10

Plotting the data on the graph:


 Stacked bar chart:
 img<-
barplot(bmat,xlab="months",ylab="no.of.copies",main="monthwisebooksale",names.arg=c("jan","
feb"), col=rainbow(5))
 legend(x="topright",cex=1, legend=c("C","C++","DBMS","JAVA","MongoDB"), fill=rainbow(5))

4.6.4 Histogram:
 A histogram represents the frequencies of values of a variable bucketed into ranges.
 Histogram is similar to bar chat but the difference is it groups the values into continuous ranges.
 Each bar in histogram represents the height of the number of values present in that range.
 R creates histogram using hist() function. This function takes a vector as an input and uses some
more parameters to plot histograms.
 Syntax:
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used :

 v is a vector containing numeric values used in histogram.


 main indicates title of the chart.
 col is used to set color of the bars.
 border is used to set border color of each bar.
 xlab is used to give description of x-axis.
 xlim is used to specify the range of values on the x-axis.
 ylim is used to specify the range of values on the y-axis.
 breaks is used to mention the width of each bar.
 v<-c(1,5,7,3,7,9,2,4,7,2,4,7,6,8,1)
 hist(v,xlab = "No.of books",main = "Histogram",xlim = c(0,10),ylim = c(0,10),col = "yellow",border
= "black")

4.6.5. Boxplot:
 A box graph is a chart that is used to display information in the form of distribution by drawing
boxplots for each of them.
 This distribution of data is based on five sets (minimum, first quartile, median, third quartile, and
maximum).
 Boxplots are created in R by using the boxplot() function.
 Syntax: boxplot(x, data, notch, varwidth, names, main)
- x: This parameter sets as a vector or a formula.
- data: This parameter sets the data frame.
- notch: This parameter is the label for horizontal axis.
- varwidth: This parameter is a logical value. Set as true to draw width of the box proportionate to the
sample size.
- main: This parameter is the title of the chart.
- names: This parameter are the group labels that will be showed under each boxplot.
 When we put notch=TRUE then output will be shown as below:

You might also like