0% found this document useful (0 votes)
86 views57 pages

Data Visualization in R Sem-III 2021 PDF

Uploaded by

rajveer shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views57 pages

Data Visualization in R Sem-III 2021 PDF

Uploaded by

rajveer shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Data Visualization in R

Compiled by Dr. Vanita Joshi


Data Visualization in R
R Programming offers a set of inbuilt functions and libraries to
build visualizations and present data. Before the technical
implementations of the visualization it is important to select right
chart type.
There are four basic presentation types:
• ~Comparison To determine which amongst these is best suited for your data ,
• ~Composition following questions need to be answered:
• ~Distribution
• How many variables do you want to show in a single chart?
• ~Relationship • How many data points will you display for each variable?
• Will you display values over a period of time, or among items
or groups?
R Graphics
• R Support multiple graphic function viz. plot( ), dotchart( ), hist( ),
stem( ), pie( ), barplot( ), scatter( ), boxplot(), qqplot ( ), etc.

• Plot() Function: The plot() function forms the foundation for much
of R’s base graphing operations, serving as the vehicle for producing
many different kinds of graphs. Plot() is a generic function, or a
placeholder for a family of functions. The function that is actually
called depends on the class of the object on which it is called.

R Introduction 4
Let’s see what happens when we call plot()
with an X vector and a Y vector, which are
interpreted as a set of pairs in the (x,y) plane.

>plot(c(1,2,3),c(1,2,4))

x<-c(2,3,5,6,7)
y<-c(5,3,7,3,2)

qqplot(x,y)
R Introduction 5
Scatter Plot
plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species)
legend("topright", legend = levels(iris$Species), col = 1:3, pch = 21)

R Introduction 6
Histogram
Since histograms require some data to be plotted in the first place, you
do well importing a dataset or using one that is built into R.
hist(rnorm(10,45,5))

R Introduction 7
Contd..

• You can simply make a histogram by


using the hist() function, which
computes a histogram of the given
data values. You put the name of
your dataset in between the
parentheses of this function, like
this:
> hist(AirPassengers)

R Introduction 8
Colored Histograms
• Colored Histogram
• R allows for several easy and fast ways to
optimize the visualization of diagrams,
while still using the hist() function.
• hist(AirPassengers, main="Histogram for
Air Passengers", xlab="Passengers",
border="blue", col="yellow",
xlim=c(100,700))

R Introduction 9
Bar Chart
> B <- c(3, 2, 25, 37, 22, 34, 19)
> barplot(B, col="darkgreen")

R Introduction 10
• Bar Chart
• First, we set up a vector of numbers. Then we count them using the
table() command, and then we plot them. The table() command
creates a simple table of counts of the elements in a data set.
>H <- c("A","B","B","C","C","C","A","C","B","D","D","A","A","A")
Now we count the elements using the table() command, as follows:
>count <- table(H)
>count
[1] A B C D
5342
>barplot(count)

R Introduction 11
Contd..
>data <- structure(list(W= c(1L, 3L, 6L, 4L, 9L), X = c(2L, 5L, 4L, 5L,
12L), Y = c(4L, 4L, 6L, 6L, 16L), Z = c(3L, 5L, 6L, 7L, 6L)), .Names =
c("W", "X", "Y", "Z"), class = "data.frame", row.names = c(NA, -5L))
>attach(data)
>print(data)

W X YZ
11 2 43
23 5 45
36 4 66
44 5 67
5 9 12 16 6

R Introduction 12
colours <- c("red", "orange", "blue", "yellow", "green")
> barplot(as.matrix(data), main="My Barchart", ylab = "Numbers",
cex.lab = 1.5, cex.main = 1.4, beside=TRUE, col=colours)

R Introduction 13
Pie Chart
Create a simple pie chart using vector.

>B <- c(2, 4, 5, 7, 12, 14, 16)


>pie(B)

R Introduction 14
Colored Pie Chart
>B <- c(2, 4, 5, 7, 12, 14, 16)
>pie(B, main="My Piechart", col=rainbow(length(B)),
labels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))

R Introduction 15
Box Plot
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
xlab="Number of Cylinders", ylab="Miles Per Gallon")

R Introduction 16
Density Plot
d <- density(rnorm(10,50,5))
plot(d)
polygon(d, col="red", border="blue")

R Introduction 17
Graphs using package ggplot2
Prerequisite for ggplot2
➢Install.packages(“tidyverse”)
## --------------install.packages(“dplyr”)
Why ggplot2?
• Consistent underlying grammar of graphics (Wilkinson, 2005)
• Plot specification at a high level of abstraction
• Very flexible
• Theme system for polishing plot appearance
• Mature and complete graphics system
• Many users, active mailing list
Grammar of Graphics
ggplot2 is a powerful and a flexible R package, implemented by Hadley Wickham, for
producing elegant graphics.
• The concept behind ggplot2 divides plot into three different fundamental parts: Plot =
data + Aesthetics + Geometry.
• The principal components of every plot can be defined as follow:
Data is a data frame
Aesthetics is used to indicate x and y variables. It can also be used to control the color,
the size or the shape of points, the height of bars, etc…..

Geometry defines the type of graphics (histogram, box plot, line plot, density plot, dot
plot, ….)
• There are two major functions in ggplot2 package: qplot() and ggplot() functions.
• qplot() stands for quick plot, which can be used to produce easily simple plots.
• ggplot() function is more flexible and robust than qplot for building a plot piece by
piece.
What comes under Grammar of Graphics??
Scatter Plot
When to use: Scatter Plot is used to see the relationship between two
continuous variables.
Use Big mart dataset, if we want to visualize the items as per their cost
data, then we can use scatter plot chart using two continuous variables,
namely Item_Visibility & Item_MRP as shown below.

First lets install the “ggplot2” package and initiate the library.
>Install.packages(“ggplot2”)
>library(ggplot2)
Import the Big Mart dataset in R
Here is the R code for simple scatter plot using function ggplot() with geom_point().
ggplot(data = df, mapping = aes(x = Age, y = Income)) +
geom_point() + geom_smooth(se=T)
Exercise for students:
The mpg data frame found in ggplot2 which contains observations collected
by the US Environmental Protection Agency on 38 models of car.
? mpg
• What does the relationship between engine size and fuel efficiency look
like? Is it positive? Negative? Linear? Nonlinear?
Answer: Ans1: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y =
hwy))
• Make a scatterplot of hwy vs cyl.
• What happens if you make a scatterplot of class vs drv? Why is the plot not
useful?
View a third variable also in same chart, say a categorical variable (Item_Type) which will
give the characteristic (item_type) of each data set. Different categories are depicted by
way of different color for item_type in below chart.
>ggplot(data=my_data)+geom_point(mapping=aes(x=Item_Visibility,
y=Item_MRP,color=Item_Type))
Exercise for students:
Question1: Display a scatter plot for engine size and fuel efficiency
(hwy & cty) and with cylinder as well .
Do the categorization by type of car (class) in mpg data frame along
with title to plot as My Plot. Use aesthetics (color,size) for the class.
>ggplot(data=my_data)
+geom_point(mapping=aes(x=Item_Visibility,
y=Item_MRP,size=Item_Type))
>gplot(data=my_data)+geom_point(mapping=aes(x=Item_
Visibility, y=Item_MRP,shape=Item_Type))
We can even make it more visually clear by creating separate scatter plots for each separate Item_Type as shown below.
Exercise:
1. What happens if you facet on a continuous variable?
2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean?
How do they relate to this plot?
3. What is the outcome of following:
>ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y =
hwy)) + + facet_grid(drv ~ .) / facet_grid(. ~ cyl)
4. Read ?facet_wrap. What does nrow do? What does ncol do? What
other options control the layout of the individual panels? Why
doesn’t facet_grid() have nrow and ncol arguments?
Exercise:
• Use a data set midwest
from ggplot2 package
and draw a scatter plot
to correlate area with
population segregated
based on population
density and states.
>library(ggplot2)
>?midwest
• Label the plot well.
Exercise: Draw the following plot using mtcars dataset in R.
Histogram
• When to use: Histogram is used to plot continuous variable.
It breaks the data into bins and shows frequency distribution
of these bins. We can always change the bin size and see the
effect it has on visualization.
• From Big mart dataset, if we want to know the count of
items on basis of their cost, then we can plot histogram using
continuous variable Item_MRP as shown below.
• Here is the R code for simple histogram plot using function
ggplot() with geom_histogram().
Exercise:
• Use the diamonds dataset comes in ggplot2 and contains information
about ~54,000 diamonds, including the price, carat, color, clarity,
and cut of each diamond.
?diamonds
Q: displays the total number of diamonds in the diamonds dataset,
grouped by cut.
Q. Add aesthetics color, fill and third variable as clarity to the above
graph.
Bar and Stack Bar Chart
When to use: Bar charts are recommended when you want to
plot a categorical variable or a combination of continuous and
categorical variable.
From Big mart dataset, if we want to know number of marts
established in particular year, then bar chart would be most
suitable option, use variable Establishment Year as shown
below.
Here is the R code for simple bar plot using function ggplot()
for a single continuous variable.
Simple Bar Chart (survey on liking of fruits)
>s1<- data.frame(fruit=c("Apple", "Banana", "Grapes", "Kiwi", "Orange",
"Pears"), people=c(40, 50, 30, 15, 35, 20))
>s1
>ggplot(s1, aes(x=fruit, y=people)) + geom_bar(stat="identity")
Coloured & labelled Barplot
• ggplot(s1, aes(x=fruit, y=people, fill=fruit)) +geom_bar(stat="identity")
+ ggtitle("Favorite fruit survey") + xlab("Fruits") + ylab("Number of
People")
• ggplot(survey, aes(x=fruit, y=people, fill=fruit)) + geom_bar(stat="identity") +
geom_text(aes(label=people), vjust=1.5, colour="white", size=3.5)+theme_bw()
Stacked Bar Chart
• Stacked bar chart is an advanced version of bar chart, used for
visualizing a combination of categorical variables.
• From mart dataset, if we want to know the count of outlets on basis
of categorical variables like its type (Outlet Type) and location (Outlet
Location Type) both, stack chart will visualize the scenario in most
useful manner.
• Here is the R code for simple stacked bar chart using function
ggplot().
Exercise: Draw the given plot using
mtcars dataset.
Box Plot
• When to use: Box Plots are used to plot a combination of categorical and
continuous variables. This plot is useful for visualizing the spread of the data
and detect outliers. It shows five statistically significant numbers- the
minimum, the 25th percentile, the median, the 75th percentile and the
maximum.
• From big mart dataset, if we want to identify each outlet’s detailed item
sales including minimum, maximum & median numbers, box plot can be
helpful. In addition, it also gives values of outliers of item sales for each
outlet as shown in below chart.
• The black points are outliers. Outlier detection and removal is an essential
step of successful data exploration.
• Here is the R code for simple box plot using function ggplot() with
geom_boxplot.
Area Chart
• When to use: Area chart is used to show continuity across a variable
or data set. It is very much same as line chart and is commonly used
for time series plots. Alternatively, it is also used to plot continuous
variables and analyze the underlying trends.
• From Big mart dataset, when we want to analyze the trend of item
outlet sales, area chart can be plotted as shown below. It shows count
of outlets on basis of sales.
• Here is the R code for simple area chart showing continuity of Item
Outlet Sales using function ggplot() with geom_area.
Correlogram
• When to use: Correlogram is used to test the level of co-relation among
the variable available in the data set. The cells of the matrix can be shaded
or colored to show the co-relation value.
• Darker the color, higher the co-relation between variables. Positive co-
relations are displayed in blue and negative correlations in red color. Color
intensity is proportional to the co-relation value.
• From Big Mart dataset, let’s check co-relation between Item cost, weight,
visibility along with Outlet establishment year and Outlet sales from below
plot.
• In our example, we can see that Item cost & Outlet sales are positively
correlated while Item weight & its visibility are negatively correlated.
>install.packages(“corrgram”)
>library(corrgram)
Class Exercise (ggplot2)
1. Which variables in mtcars are categorical? Which variables are continuous?
(Hint: type ?mtcars)
2. What geom would you use to draw a line chart? A boxplot? A histogram? An
area chart?
3. Map a continuous variable to color, size & shape. How do these aesthetics
behave differently for categorical verses continuous variables?
4. What happen if same variable would be mapped on multiple aesthetics?
5. What will be the outcome of :
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() + geom_smooth(se = FALSE)
6. What is the purpose of se argument in geom_smooth()?
7. Create different types of graphs using ggplot2? (atleast 8 )
Sources: notespg

You might also like