Unit-14 Data Interfacing and Visualisation in R
Unit-14 Data Interfacing and Visualisation in R
14.1 INTRODUCTION
In the previous unit, you have learnt about basic concepts of R programming.
This unit explains how to read and analyse data in R from various file types
including- CSV, Excel, binary, XML, JSON, etc. It also discusses how to extract
and work on data in R from databases and also web data. The unit also explains
in detail about data cleaning and pre-processing in R. In the later sections, the
unit explores the concept of visualisations in R. Various types of graphs and
charts, including - bar charts, box plots, histograms, line graphs and scatterplots,
are discussed.
14.2 OBJECTIVES
After going through this Unit, you will be able to:
• explain the various file types and their interface that can be processed for
data analysis in R;
• read, write and analyse data in R from different type of files including-
CSV, Excel, binary, XML and JSON;
• extract and use data from databases and web for analysis in R;
• explain the steps involved in data cleaning and pre-processing using R;
• Visualise the data using various types of graphs and charts using R and
explain their usage.
25
Basics of R Programming
14.3 READING DATA FROM FILES
In R, you can read data from files outside of the R environment. One
may also write data to files that the operating system can store and further
access. There is a wide range of file formats, including CSV, Excel,
binary, and XML, etc., R can read and write from.
14.3.1 CSV Files
Input as CSV File:
CSV file is a text file in which column values are separated by commas.
For example, you can create data with name, programme, phone of
students. By copying and pasting this data into Windows Notepad, you
can create the CSV file. Using notepad's Save As option, save the file as
input.csv.
Reading a CSV File:
Function used to read a CSV file: read.csv()
Microsoft Excel is the most extensively used spreadsheet tool and it uses the.xls
or.xlsx file extension to store data. Using various Excel-specific packages, R
can read directly from these files. XLConnect, xlsx, and gdata are a few
examples of such packages. The xlsx package also allows R to write to an Excel
file.
26
Data Interfacing & Visualization in R
Install xlsx Package
Syntax:
writeBin(object, con)
readBin(con, what, n )
where,
• The connection object con is used to read or write a binary file.
• The binary file to be written is the object.
• The mode that represents the bytes to be read, such as character,
integer, etc is what.
• The number of bytes to read from the binary file is given by n.
27
Basics of R Programming
install.packages("XML")
28
Data Interfacing & Visualization in R
29
Basics of R Programming
14.3.6 Databases
RMySQL Package
R contains a built-in package called "RMySQL" that allows you to
connect to a MySql database natively. The following command will
install this package in the R environment.
install.packages("RMySQL")
Connecting R to MySQL
30
Data Interfacing & Visualization in R
Many websites make data available for users to consume. The World
Health Organization (WHO), for example, provides reports on health and
medical information in CSV, txt, and XML formats. You can
programmatically extract certain data from such websites using R
applications. "RCurl," "XML," and "stringr" are some R packages that
are used to scrape data from the web. They are used to connect to URLs,
detect required file links, and download the files to the local
environment.
Install R Packages
For processing the URLs and links to the files, the following packages
are necessary.
install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")
14.5 VISUALIZATION IN R
In the previous section, we have discussed about obtaining input from different
types of data. This section explains various types of graphs that can be drawn
using R. It may please be noted that only selected types of graphs have been
presented here.
Syntax:
barplot(H,xlab,ylab,main, names.arg,col)
where,
• In a bar chart, H is a vector or matrix containing numeric values.
• The x axis label as xlab.
• The y axis label is ylab.
• The title of the bar chart is main.
• names.arg is a list of names that appear beneath each bar.
• col is used to color the graph's bars.
32
Data Interfacing & Visualization in R
More parameters can be added to the bar chart to increase its capabilities.
The title is added using the main parameter. Colors are added to the bars
using the col parameter. To express the meaning of each bar, args.name
is a vector with the same number of values as the input vector.
Figure 14.18: Function for plotting Bar chart with labels and
colours
33
Basics of R Programming
34
Data Interfacing & Visualization in R
14.5.3 Histograms
Syntax:
hist(v,main,xlab,xlim,ylim,breaks,col,border)
where,
• The parameter v is a vector that contains the numeric values for
which histogram is to be drawn.
• The title of the chart is shown by the main.
• The colour of the bars is controlled by col.
• Each bar's border colour is controlled by the border parameter.
• The xlab command is used to describe the x-axis.
• The x-axis range is specified using the xlim parameter.
• The y-axis range is specified with the ylim parameter.
• The term "breaks" refers to the breadth of each bar.
35
Basics of R Programming
A graph that uses line segments to connect a set of points is known as the
line graph. These points are sorted according to the value of one of their
coordinates (typically the x-coordinate). Line charts are commonly used
to identify data trends.
The line graph was created using R's plot() function.
Syntax:
plot(v,type,col,xlab,ylab)
where,
• The numeric values are stored in v, which is a vector.
• type takes values, "p","l","o". The value "p" is used to draw only
points, "l" is used to draw only lines, and "o" is used to draw both
points and lines.
• xlab specifies the label for the x axis.
• ylab specifies the label for the x axis..
• Main is used to specify the title of chart .
• col is used to specify the color of the points and/or the lines.
36
Data Interfacing & Visualization in R
Figure 14.27: A Line Chart with multiple lines for data of Figure
14.26
14.5.4 Scatterplots
Syntax:
37
Basics of R Programming
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Scatterplot Matrices
The scatterplot matrix is used when there are more than two variables
and you want to identify the correlation between one variable and the
others. To make scatterplot matrices, we use pairs() function.
Syntax:
pairs(formula, data)
where,
• The formula represents a set of variables that are utilised in pairs.
• The data set from which the variables will be derived is referred
to as data.
38
Data Interfacing & Visualization in R
14.6 Summary
In this unit you have gone though various file types that can be processed for
data analysis in R and further discussed their interfaces. R can read and write a
variety of file types outside the R environment, including CSV, Excel, binary,
XML and JSON. Further, R can readily connect to various relational databases,
such as MySQL, Oracle, and SQL Server, and retrieve records as a data frame
that can be modified and analysed with all of R's sophisticated packages and
functions. The data can also be programmatically extracted from websites using
R applications. "RCurl," "XML," and "stringr" are some R packages that are
used to scrape data from the web. The unit also explains the concept of data
cleaning and pre-processing which is the process of identifying, correcting and
removing incorrect raw data, familiarization with the dataset, checking data for
structural errors and data irregularities and deciding on how to deal with missing
values are the steps involved in cleaning and preparing data which is mainly
considered among the best practices. The unit finally explores the concept of
39
Basics of R Programming
visualisations in R. There are various types of graphs and charts including- bar
charts, box plots, histograms, line graphs and scatterplots that can be used to
visualise the data effectively. The unit explained the usage and syntax for each
of the illustration with graphics.
14.7 Answers
Check your progress 1
1. Install.packages(“rjson”)
library(rjson)
2. rb mode opens the file in the binary format for reading and wb mode
opens the file in the binary format for writing.
3. The checklist points used for cleaning/ preparing data:
Check for data irregularities: You may check for the invalid values and
outliers.
Decide on how to deal missing values: Either delete the observations if
they are not providing any meaningful insights to our data or imputing
the data with some logical values like mean or median based on the
observations.
Check your progress 2
1. A scatter plot is a chart used to plot a correlation between two or more
variables at the same time
2. We use a histogram to plot the distribution of a continuous variable,
while we can use a bar chart to plot the distribution of a categorical
variable.
3. When you are trying to show “relationship” between two variables, you
will use a scatter plot or chart. When you are trying to show
“relationship” between three variables, you will have to use a bubble
chart.
40