0% found this document useful (0 votes)
14 views39 pages

IDS Unit-5

This document provides an overview of using the R programming language for creating charts and graphs, emphasizing its graphical capabilities for data representation. It details various types of graphs, such as pie charts and line graphs, and explains the functions and parameters used to create them, including examples. Additionally, it discusses advanced features like overlaying plots and customizing colors for better visualization.

Uploaded by

upender
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views39 pages

IDS Unit-5

This document provides an overview of using the R programming language for creating charts and graphs, emphasizing its graphical capabilities for data representation. It details various types of graphs, such as pie charts and line graphs, and explains the functions and parameters used to create them, including examples. Additionally, it discusses advanced features like overlaying plots and customizing colors for better visualization.

Uploaded by

upender
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT-5

CHARTS AND GRAPHS


INTRODUCTION
Since we know that a huge amount of data is generated when it comes to interpreting any sector.
To acquire significant insights, it is usually preferable to depict data through charts and graphs
rather than scanning large Excel sheets. The R programming language is mostly used to depict
data graphically in software for statistics and data analytics.
The R programming language includes some simple and easy techniques for converting data into
visually appealing features such as graphs and charts. In R, charts and graphs are used to
graphically depict the data. R Programming language has numerous libraries to create charts and
graphs.
R language is mostly used for statistics and data analytics purposes to represent the data
graphically in the software. To represent those data graphically, charts and graphs are used in
R. Graphs in R language is a preferred feature which is used to create various types of graphs and
charts for visualizations. R language supports a rich set of packages and functionalities to create
the graphs using the input data set for data analytics.
The most commonly used graphs in the R language are scattered plots, bar plot, box plots, mosaic
plot, line graphs, pie charts, dot chart, scatter graph, coplot, histograms, and bar charts. R graphs
support both two dimensional and three-dimensional plots for exploratory data analysis. There are
R function like plot(), barplot(), pie() are used to develop graphs in R language. R package like
ggplot2 supports advance graphs functionalities.
CREATING GRAPHS
R is profoundly used for its substantial techniques for graphical interpretation of data of utmost
importance of analysts. The primary styles are: dot plot, density plot (can be classified as
histograms and kernel), line graphs, bar graphs (stacked, grouped and simple), pie charts
(3D,simple and expounded), line graphs(3D,simple and expounded), box-plots(simple, notched
and violin plots), bag-plots and scatter-plots (simple with fit lines, scatter-plot matrices, high-
density plots and 3-D plots). The foundational function for creating graphs: plot(). This includes
how to build a graph, from adding lines and points to attaching a legend.
The Workhorse of R Base Graphics: The plot() Function
The plot() function is used to draw points (markers) in a diagram. The plot() function forms the
foundation for much of R’s base graphing operations, serving as the vehicle for producing many
different kinds of graphs. plot() is a generic function, or a placeholder for a family of functions.
The function that is actually called depends on the class of the object on which it is called. The
basic syntax to create a line chart in R is
plot(v, type, col, xlab, ylab)
The basic syntax to give title for a line chart in R is
title(main=” “, col)
Following is the description of the parameters used –
✓ v is a vector containing the numeric values.
✓ type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw
both points and lines.
✓ xlab is the label for x axis.
✓ ylab is the label for y axis.
✓ main is the Title of the chart.
✓ col is used to give colors to both the points and lines.

Examples of plot function:


# Draw one point in the diagram, at position (1)
and position (3)
> plot(1, 3)

# Define the cars vector with 5 values


> cars <- c(1, 3, 6, 4, 9)
# Graph the cars vector with all defaults
> plot(cars)
The default argument of type is points

call plot() function with an X vector and a Y vector,


which are interpreted as a set of pairs in the (x,y) plane.
> x <- c(1,2,3)
> y <- c(1,2,4)
> plot(x, y)
This will cause a window to pop up, plotting the points
(1,1), (2,2), and (3,4), this is a very plain-Jane graph.
# Draw dots in a sequence, on both the x-axis and the y-
axis, use the : operator
> plot(1:10)

# Use cex=number to change the size of the points (1 is d


efault, while 0.5 means 50% smaller, and 2 means 100%
larger)
> plot(1:10, cex=2)

# Use pch with a value from 0 to 25 to change the


point shape format
> plot(1:10, pch=25, cex=1)

# To draw a line using type parameter with value l, it


connect all the points in the diagram
> plot(1:10, type="l")
# Define the cars vector with 5 values
> cars <- c(1, 3, 6, 4, 9)
# Graph cars using blue points with lines
> plot(cars, type="o", col="blue")
# Create a title with a red, bold/italic font
> title(main="Autos",col.main="red", font.main=4)

# Create the data for the chart.


> v <- c(7,12,28,3,41)
# Give the chart file a name.
> png(file = "line_chart1.jpg")
# Plot the bar chart.
> plot(v,type = "o")
# Save the file.
> dev.off()
RStudioGD
2
> plot(c(-3,3), c(-1,5), type = "n", xlab="x", ylab="y")
#This draws axes labeled x and y. The horizontal (x) axis
ranges from -3 to 3. The vertical (y) axis ranges from -1
to 5. The argument type="n" means that there is nothing
in the graph itself.

Overlaying Plots:- If the plot( ) function is called many


times, the current graph will be plotted in the same
window and the previously existed graph will be replaced
by the same. But in order to have a comparison between
the results this plot is used. It is done by using the lines(
) and points( ) functions which add lines and points to the
respective existing plot.
> x <- seq(-pi, pi, 0.1)
> plot(x,sin(x), main="overlaying Graphs",
type="l", col="blue")
> lines(x,cos(x), col="red")
> legend('topleft',c("sin(x)","cos(x)"),
fill=c("blue","red"))
Abline function:- This function simply draws a straight
line, with the function’s arguments treated as the
intercept and slope of the line.
> x <- c(1,2,3)
> y <- c(1,3,8)
> plot(x,y,col="red", pch="+")
> lmout <- lm(y ~ x)
> abline(lmout)
After the call to plot(), the graph will simply show the
three points, along with the x- and y- axes with hash
marks. The call to abline() then adds a line to the current
graph. Now, which line is this?
As the result of the call to the linear-regression function
lm( ) is a class instance containing the slope and intercept
of the fitted line, as well as various other quantities that
don’t concern us here. We’ve assigned that class instance
to lmout. The slope and intercept will now be in
lmout$coefficients.

Some of the coloring functions in Graphs are


S.No Function Usage Example
1 colors( ) Returns the built-in color names which > col <- colors() [234]
R knows about. > col
[1] "gray81"
2 rgb( ) This function creates colors > rgb(1,0,1)
corresponding to the given intensities [1] "#FF00FF"
(between 0 and max) of the red, green > rgb(33, 64, 123, max=255) [1]
and blue primaries. It returns hex code of "#21407B"
the color
> rgb(0.3, 0.7, 0.5)
[1] "#4CB280"
3 cm.colors( ) Create a vector of n contiguous colors. > cm.colors(1)
[1] "#80FFFFFF"
4 rainbow( ) Create a vector of n contiguous colors. > rainbow(3)
[1] "#FF0000FF"
"#00FF00FF" "#0000FFFF"
5 heat.colors( ) Create a vector of n contiguous colors. > heat.colors(1)
[1] "#FF0000FF"
6 terrain.colors( ) Create a vector of n contiguous colors. > terrain.colors(2)
[1] "#00A600FF"
"#F2F2F2FF"
PIE CHART
A pie chart is a circular graphical view of data. A pie-chart is a representation of values as slices
of a circle with different colors. The slices are labeled and the numbers corresponding to each slice
is also represented in the chart. It depicts a special chart that uses “pie slices”, where each sector
shows the relative sizes of data. A circular chart cuts in the form of radii into segments describing
relative frequencies or magnitude also known as a circle graph.
In pie chart, the circle is drawn with radii proportional to the square root of the quantities to be
represented because the area of a circle is given by 2pr2. The sectors are colored and shaded
differently. To construct a pie chart, we draw a circle with some suitable radius (square root of the
total). The angles are calculated for each sector as follows:
𝐂𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭 𝐏𝐚𝐫𝐭
Angles for each sector = × 360o
𝑻𝒐𝒕𝒂𝒍

R Programming Language uses the function pie() to create pie charts. It takes positive numbers
as a vector input.
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used
• x is a vector containing the numeric values used in the pie chart.
• labels is used to give description to the slices.
• radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
• main indicates the title of the chart
• col indicates the color palette.
• clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Creating a simple pie chart: To create a simple R pie chart
✓ by using the above parameters, we can draw a pie chart
✓ It can be described by giving simple labels.
Ex: # Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
# Plot the cart
> pie(slices, labels = lbls, main="Pie Chart of Countries")
Ex: # Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
# Plot the cart
> pie(slices, labels = lbls, main="Pie Chart of Countries", clockwise = TRUE, col=colors())

Pie chart including the title and colors: To create a color and title pie chart.
✓ Take all parameters which are required to make a R pie chart by giving a title to the chart
and adding labels.
✓ We can add more features by adding more parameters with more colors to the points.
Ex: # Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
# Plot the chart with title and rainbow color pallet
> pie(slices, labels = lbls, main="Pie Chart of Countries", col = rainbow(length(slices)))

Slice Percentage: Slice percentage is one of the property of the pie chart. We can show the chart
in the form of percentages as well as add legends.
Ex1: Pie Chart with Slice Percentages
# Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> pct <- round(slices/sum(slices)*100)
# Print percent’s
> pct
[1] 20 24 8 32 16
> pie(slices, labels = pct, col=rainbow(length(lbls)), main="Pie Chart of Countries")
Ex2: # Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
# calculate percent’s to slices
> pct <- round(slices/sum(slices)*100)
# Print percent’s
> pct
[1] 20 24 8 32 16
# add percent’s to labels
> lbls <-paste(lbls, pct)
# Print labels
> lbls
[1] "US 20" "UK 24" "Australia 8" "Germany 32" "France 16"
# add % to labels
> lbls<-paste(lbls, "%")
# Print labels
> lbls
[1] "US 20 %" "UK 24 %" "Australia 8 %" "Germany 32 %" "France 16 %"
> pie(slices, labels = lbls, main="Pie Chart of Countries", col=rainbow(length(lbls)))

Chart Legend: The legend is a side section of the chart that gives a small text description of
each series.
Ex1: Pie Chart with Chart legend
# Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> pct <- round(slices/sum(slices)*100)
# Plot the chart
> pie(slices, labels = pct, main="Pie Chart of Countries", col=rainbow(length(lbls)))
> legend("topleft", c("US", "UK", "Australia", "Germany", "France"), cex = 0.4,
fill = rainbow(length(slices)))

Add pie chart color palettes: With the help of .pal function of the RColorBrewer package in
R.
Ex: # Install RColorBrewer package if not installed in RStudio
> install.packages("RColorBrewer")
# Get the library
> library(RColorBrewer)
# Create the data for graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> labl<- brewer.pal(length(slices), "Set2")
> pie(slices, labl=lbls)

modify the line type of the borders of the plot we can make use of the lty argument:
# Get the library
> library(RColorBrewer)
# Create the data for graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> labl<- brewer.pal(length(slices), "Set2")
> pie(slices, labl=lbls, lty = 2, col=rainbow(length(lbls)))
Add shading lines with the density argument:
# Get the library
> library(RColorBrewer)
# Create the data for graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> labl<- brewer.pal(length(slices), "Set2")
> pie(slices, labl=lbls, density = 50, angle = 45, col=rainbow(length(lbls)))

3D Pie Chart: The pie3D( ) function in the plotrix package provides pie3D() function to plot 3D
pie chart.
# Install plotrix package if not installed in RStudio
> install.packages("plotrix")
# Get the library
> library(plotrix)
# Create the data for graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> pie3D(slices,labels=lbls,explode=0.1, main="Pie Chart of Countries")
Advantages
• Display relative proportions of multiple classes of data.
• Size of the circle can be made proportional to the total quantity it represents.
• Summarize a large data set in visual form.
• Be visually simpler than other types of graphs.
• Permit a visual check of the reasonableness or accuracy of calculations.
Disadvantages
• Do not easily reveal exact values.
• Many pie charts may be needed to show changes over time.
• Fail to reveal key assumptions, causes, effects, or patterns.
• Be easily manipulated to yield false impressions
BAR CHART OR BAR PLOT
• Bar charts are a popular and effective way to visually represent categorical data in a
structured manner. R is a powerful programming language for data analysis and
visualization.
• A bar chart also known as bar graph is a pictorial representation of data that presents
categorical data with rectangular bars with heights or lengths proportional to the values
that they represent. In other words, it is the pictorial representation of the dataset. These
data sets contain the numerical values of variables that represent the length or height.
• R uses the function barplot( ) to create bar charts. R can draw both vertical and horizontal bars in
the bar chart. In bar chart each of the bars can be given different colors.
• A bar graph is a chart that uses bars to show comparisons between categories of data. A bar graph
will have two axes. One axis will describe the types of categories being compared, and the other
will have numerical values that represent the values of the data. It does not matter which axis is
which, but it will determine what bar graph is shown. If the descriptions are on the horizontal axis,
the bars will be oriented vertically, and if the values are along the horizontal axis, the bars will be
oriented horizontally.
• Syntax: barplot(H, xlab, ylab, main, names.arg, col)
Following is the description of the parameters used −
✓ H is a vector or matrix containing numeric values used in bar chart.
✓ xlab is the label for x axis.
✓ ylab is the label for y axis.
✓ main is the title of the bar chart.
✓ names.arg is a vector of names appearing under each bar.
✓ col is used to give colors to the bars in the graph.
• Types of Bar Plot: There are four types of bar diagrams, they are
1. Simple Bar plot
2. Multiple Bar plot
3. Sub-divided Bar plot or Component Bar plot
4. Deviation Bars
Simple Bar plot: Simple Bar plot is used to compare two or more independent variables. Each
variable will relate to a fixed value. The values are positive and therefore, can be fixed to the
horizontal value. In order to create a Bar Chart:
✓ A vector (H <- c(Values…)) is taken which contains numeral values to be used.
✓ This vector H is plot using barplot().
Creating a Simple Bar Chart in R:
Ex1: # Create a data for chart
> A <- c(17, 32, 8, 53, 10)
# Plot the bar chart
> barplot(A, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart")

Ex2: # Create a data for chart


> counts <- table(mtcars$gear)
> barplot(counts, main="Car Distribution", xlab="Number of Gears",
ylab="Total no of vechicles")

Density / Bar Texture: To change the bar texture, use the density parameter
Ex: > A <- c(17, 15, 21, 29, 16, 22)
> B <- c("Jan", "feb", "Mar", "Apr", "May", "Jun")
> barplot(A, names.arg = B, density = 10)
Creating a Horizontal Bar Chart in R: To create a horizontal bar chart:
1. Take all parameters which are required to make a simple bar chart.
2. Now to make it horizontal new parameter is added.
barplot(A, horiz=TRUE)
Ex: # Create the data for the chart
> A <- c(17, 32, 8, 53, 1)
> barplot(A, horiz = TRUE, xlab = "X-axis", ylab = "Y-axis", main ="Horizontal Bar
Chart")

Adding Label, Title and Color in the BarChart: Label, title and colors are some properties in
the bar chart which can be added to the bar by adding and passing an argument.
• To add the title in bar chart.
barplot( A, main = title_name )
• X-axis and Y-axis can be labeled in bar chart. To add the label in bar chart.
barplot( A, xlab= x_label_name, ylab= y_label_name)
• To add the color in bar chart.
barplot( A, col=color_name)
Ex: # Create the data for the chart
> A <- c(17, 15, 21, 29, 16, 22)
> B <- c("Jan", "feb", "Mar", "Apr", "May", "Jun")
# Plot the bar chart
> barplot(A, names.arg = B, xlab ="Month", ylab ="No.of working days",
col = rainbow(5), main ="Month wise working days")
Add Data Values on the Bar:
Ex: # Create the data for the chart
> A <- c(17, 15, 21, 29, 16, 22)
> B <- c("Jan", "feb", "Mar", "Apr", "May", "Jun")
# Plot the bar chart
> barplot(A, names.arg = B, xlab ="Month", ylab ="No.of working days",
col =" steelblue", main ="Month wise working days", cex.main = 1.5,
cex.lab = 1.2, cex.axis = 1.1)
# Add data labels on top of each bar
> text(x = barplot(A, names.arg = B, col = "steelblue", ylim = c(0, max(A) * 1.2)),
y = A + 1, labels = A, pos = 3, cex = 1.2, col = "black")

GROUPED BAR PLOT OR MULTIPLE BAR PLOT: Multiple bar chart is an extension of
simple bar chart. Grouped bars are used to represent related sets of data. For example, imports and
exports of a country together are shown in multiple bar chart. Each bar in a group is shaded or
colored differently for the sake of distinction.
Ex: > colors = c("green", "orange", "brown")
> months <- c("Mar", "Apr", "May", "Jun", "Jul")
> regions <- c("East", "West", "North")
> Values <- matrix(c(2, 9, 3, 11, 9, 4, 8, 7, 3, 12, 5, 2, 8, 10, 11),
nrow = 3, ncol = 5, byrow = TRUE)
> Values
[,1] [,2] [,3] [,4] [,5]
[1,] 2 9 3 11 9
[2,] 4 8 7 3 12
[3,] 5 2 8 10 11
> barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month",
ylab = "Revenue", col = colors, beside = TRUE)
> legend("topleft", regions, cex = 0.7, fill = colors)
SUB-DIVIDED BAR PLOT OR COMPONENT BAR PLOT: This chart consists of bars which
are sub-divided into two or more parts. This type of diagram shows the variation in different
components within each class as well as between different classes. Sub-divided bar plot is also
known as component bar chart or staked chart.
Ex: > colors = c("green", "orange", "brown")
> months <- c("Mar", "Apr", "May", "Jun", "Jul")
> regions <- c("East", "West", "North")
> Values <- matrix(c(2, 9, 3, 11, 9, 4, 8, 7, 3, 12, 5, 2, 8, 10, 11),
nrow = 3, ncol = 5, byrow = TRUE)
> Values
[,1] [,2] [,3] [,4] [,5]
[1,] 2 9 3 11 9
[2,] 4 8 7 3 12
[3,] 5 2 8 10 11
> barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month",
ylab = "Revenue", col = colors)
> legend("topleft", regions, cex = 0.7, fill = colors)

DEVIATION BAR PLOT: A graph displays a deviation relationship when it features how one
or more sets of quantitative values differ from a reference set of values. The graph does this by
directly expressing the differences between two sets of values
Deviation bars are used to represent net quantities - excess or deficit i.e. net profit, net loss, net
exports or imports, swings in voting etc. Such bars have both positive and negative values. Positive
values lie above the base line and negative values lie below it.
Ex: > cars <- c(12, -4, 56, 2, -12, 45, -25)
> barplot(cars, col="light blue")

Advantages:
• Show each data category in a frequency distribution
• Display relative numbers/proportions of multiple categories
• Summarize a large amount of data in a visual, easily interpretable form
• Make trends easier to highlight than tables do
• Estimates can be made quickly and accurately
• Permit visual guidance on accuracy and reasonableness of calculations
• Accessible to a wide audience
Disadvantages:
• Often require additional explanation
• Fail to expose key assumptions, causes, impacts and patterns
• Can be easily manipulated to give false impressions

BOX PLOT
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into
three quartiles. This graph represents the minimum, maximum, median, first quartile and third
quartile in the data set. It is also useful in comparing the distribution of data across data sets by
drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function. The syntax for boxplots is follows.
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
• x is a vector or a formula.
• data is the data frame.
• notch is a logical value. Set as TRUE to draw a notch.
• varwidth is a logical value. Set as true to draw width of the box proportionate to the sample
size.
• names are the group labels which will be printed under each boxplot.
• main is used to give a title to the graph.

Ex: # Create the data for the chart


> x <- c(7, 3, 2, 4, 8)
# Plot the bar plot
> boxplot(x, col="pink")

Create boxplot horizontally: We can draw boxplot horizontally by making horizontal as TRUE,
the default value is FALSE.
Ex: # Create the data for the chart
> x <- c(7, 3, 2, 4, 8)
> boxplot(x, col="orange", horizontal = TRUE, border = "blue")
Ex: # Create the data for the chart
> input <- mtcars[,c('mpg', 'cyl')]
> print(head(input))
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
> boxplot(mpg ~ cyl, data = mtcars, xlab ="Number of Cylinder", ylab = "Miles Per Gallon",
main = "Mileage Data", col= c("green", "royalblue", "red"))

•In R, the function boxplot() can also take in formulas of the form y~x where y is a numeric
vector which is grouped according to the value of x.
• For example, in our dataset mtcars, the mileage per gallon mpg is grouped according to
the number of cylinders cyl present in cars.
Boxplot using notch: In R, we can draw a boxplot using a notch. It helps us to find out how the
medians of different data groups match with each other. Let's see an example to understand how a
boxplot graph is created using notch for each of the groups.
Ex: # Create the data for the chart
> input <- mtcars[,c('mpg','cyl')]
> print(head(input))
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
> boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data",notch = TRUE, varwidth = TRUE,
col = c("green","yellow","purple"),names = c("High", "Medium", "Low"))

Violin Plots: R provides an additional plotting scheme which is created with the combination of
a boxplot and a kernel density plot. The violin plots are created with the help of vioplot() function
present in the vioplot package.
Ex: # Install vioplot package if not available in system.
> install.packages("vioplot")
# Loading the vioplot package
> library(vioplot)
# Creating data for vioplot function
> x1 <- mtcars$mpg[mtcars$cyl==4]
> x2 <- mtcars$mpg[mtcars$cyl==6]
> x3 <- mtcars$mpg[mtcars$cyl==8]
# Creating vioplot function
> vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"), col="green")
Bagplot 2-Dimensional: The bagplot(x, y) function in the aplpack package provides a biennial
version of the univariate boxplot. The bag contains 50% of all points. The bivariate median is
approximate. The fence separates itself from the outside points, and the outlays are displayed.

Ex: # Install aplpack package if not available in system.


> install.packages("aplpack")
> attach(mtcars)
# Creating bagplot function
> bagplot(wt, mpg, xlab="Car Weight", ylab="Miles Per Gallon",
main="2D Boxplot Extension")

Advantages:
• A box plot is a good way to summarize large amounts of data.
• It displays the range and distribution of data along a number line.
• Box plots provide some indication of the data’s symmetry and skew-ness.
• Box plots show outliers.
Disadvantages
• Original data is not clearly shown in the box plot and also, mean and mode cannot be
identified in a box plot.
• Exact values not retained.
HISTOGRAM
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical intervals. A
graphical representation that manages a group of data points into different specified ranges.
A histogram is a type of bar chart which shows the frequency of the number of values which are
compared with a set of values ranges. Histogram is similar to bar chat but the difference is, the
histogram is used for the distribution, whereas a bar chart is used for comparing different entities.
In the histogram, each bar represents the height of the number of values present in the given range.
It has a special feature that shows no gaps between the bars and is similar to a vertical bar graph.
For creating a histogram, R provides hist() function, which takes a vector as an input and uses
more parameters to add more functionality. There is the following syntax of hist() function:
hist(v, main, xlab, ylab, xlim, ylim, breaks, col, border)
Following is the description of the parameters used −
• v is a vector containing numeric values used in histogram.
• main indicates title of the chart.
• col is used to set color of the bars.
• border is used to set border color of each bar.
• xlab is used to give description of x-axis.
• xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• breaks is used to mention breakpoints between histogram cells
• counts: The count of values in a particular range.
• mids: center point of multiple cells.
• density: cell density
Creating a simple Histogram in R: Creating a simple histogram chart by using the above
parameter. This vector v is plot using hist().
Ex1: # Creating data for the graph.
> v <- c(12, 24, 16, 38, 21, 13, 55, 17, 39, 10, 60)
# Creating the histogram.
> h1=hist(v, xlab = "Weight", ylab="Frequency", col = "green", border = "red")

# Print return value of hist()


> h1
$breaks
[1] 10 20 30 40 50 60
$counts
[1] 5 2 2 0 2
$density
[1] 0.04545455 0.01818182 0.01818182 0.00000000 0.01818182
$mids
[1] 15 25 35 45 55
$xname
[1] "v"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Ex2: # Creating data for the graph.
> v <- c(9, 13, 21, 8, 36, 22, 12, 41, 31, 33, 19)
# Creating the histogram.
> h <- hist(v, xlab = "Weight", col = "pink", border = "blue")

# Print return value of hist()


>h
$breaks
[1] 5 10 15 20 25 30 35 40 45
$counts
[1] 2 2 1 2 0 2 1 1
$density
[1] 0.03636364 0.03636364 0.01818182 0.03636364 0.00000000 0.03636364
[7] 0.01818182 0.01818182
$mids
[1] 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5
$xname
[1] "v"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"

Use of xlim & ylim parameter: To specify the range of values allowed in X axis and Y axis, we
can use the xlim and ylim parameters. The width of each of the bar can be decided by using breaks.
Ex: # Creating data for the graph.
> v <- c(9,13,31,8,31,22,12,31,35)
# Creating the histogram.
> hist(v, xlab = "Weight", col = "light green", border = "red", xlim = c(0, 40),
ylim = c(0, 5), breaks = 5)

Using histogram return values for labels using text(): The following example utilizes the
function text() which add text to a plot and return values to place the count above each cell.
Ex1: # Creating data for the graph.
> v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39, 120, 40, 70, 90)
# Creating the histogram.
> m<-hist(v, xlab = "Weight", ylab ="Frequency", col = "darkmagenta",
border = "pink", breaks = 5)
> text(m$mids, m$counts, labels = m$counts, adj = c(0.5, -0.5))

Ex2: > h<-hist(mtcars$mpg, breaks=12, col="skyblue4")


> text( h$mids, h$counts, labels = h$counts, adj=c(0.5, -0.5))
Histogram using non-uniform width: Break parameter tells the number of cells required in the
histogram plot.
Ex1: # Creating data for the graph.
> x<- c(5, 3, 5, 7, 3, 6, 5)
# Creating the histogram.
> hist(x, breaks = 4, col="violetred3", main="breaks=4" )

# Creating the histogram.


> hist(x, breaks = 10, col="slateblue3", main="breaks=10" )
Ex2: # Creating data for the graph.
> v <- c(12, 24, 16, 38, 21, 13, 55, 17, 39, 10, 60, 120, 40, 70, 90)
# Creating the histogram.
> hist(v, xlab = "Weight", ylab="Frequency", xlim=c(50,100), col = "darkmagenta",
border = "pink", breaks=c(10, 55, 60, 70, 75, 80, 100, 120))

Kernel Density Plots: Kernal density plots are usually a much more effective way to view the
distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.
Ex1: # Kernel Density Plot
> v <- c(9,13,21,8,36,22,12,41,31,33,19)
> d <- density(v)
>d
Call:
density.default(x = v)
Data: v (11 obs.); Bandwidth 'bw' = 6.387
x y
Min. :-11.16 Min. :6.854e-05
1st Qu. : 6.67 1st Qu. :2.653e-03
Median : 24.50 Median :1.531e-02
Mean : 24.50 Mean :1.399e-02
3rd Qu. : 42.33 3rd Qu. :2.281e-02
Max. : 60.16 Max. :2.907e-02
> plot(d)
Ex2: # Filled Density Plot
> v <- c(9,13,21,8,36,22,12,41,31,33,19)
> d <- density(v)
> plot(d, main="Kernel Density of Miles Per Gallon")
> polygon(d, col="steelblue2", border="tomato1")

Advantages:
• Visually strong.
• Can compare to normal curve.
• Usually vertical axis is a frequency count of item falling in to each category.
Disadvantages:
• Cannot read exact values because data is grouped in categories.
• More difficult to compare two data sets.
• Use only with continuous data
LINE GRAPH:
A line graph is a pictorial representation of information which changes continuously over time. A
line graph can also be referred to as a line chart. A line chart is a graph that connects a series of
points by drawing line segments between them. Within a line graph, there are points connecting
the data to show the continuous change. The lines in a line graph can move up and down based on
the data. We can use a line graph to compare different events, information, and situations. These
points are ordered in one of their coordinates (usually the x-coordinate) value. Line charts are
usually used in identifying the trends in data.
For developing line graph, R provides plot() function, which has the following syntax:
plot(v, type, col, xlab, ylab)
Following is the description of the parameters used:
• v is a vector which contains the numeric values.
• type is parameter has the following value:
✓ p: This value is used to draw only the points.
✓ l: This value is used to draw only the lines.
✓ o: This value is used to draw both points and lines
• xlab is the label for the x-axis.
• ylab is the label for the y-axis.
• main is the title of the chart.
• col is used to give the color for both the points and lines.
Creating a Simple Line Graph: Simple line graph is created using the type parameter as “o”
and input vector.
Ex: # Creating the data for the chart.
> v <- c(13, 22, 28, 7, 31)
# Plotting the bar chart.
> plot(v, type = "p")

# Plotting the bar chart.


> plot(v, type = "l")

# Plotting the bar chart.


> plot(v, type = "o")
Adding Title, Color and Labels in Line Graphs: Take all parameters which are required to
make line chart by giving a title to the chart and add labels to the axes. We can add more features
by adding more parameters with more colors to the points and lines.
Ex: # Create the data for the chart.
> v <- c(17, 25, 38, 13, 41)
# Plot the bar chart.
> plot(v, type = "o", col = "green", xlab = "Month", ylab = "Article Written",
main = "Article Written chart")

MULTIPLE LINES IN LINE GRAPH


R allows us to create a line graph containing multiple lines. R provides lines() and matplot()
functions to create a line in the line graph.
• The lines() function takes an additional input vector for creating a line. More than one line
can be drawn on the same chart by using the lines()function.
Ex: # Create the data for the chart.
> v <- c(17, 25, 38, 13, 41)
> t <- c(22, 19, 36, 19, 23)
> m <- c(25, 14, 16, 34, 29)
> plot(v, type = "o", col = "red", xlab = "Month", ylab = "Article Written ",
main = "Article Written chart")
> lines(t, type = "o", col = "blue")
> lines(m, type = "o", col = "green")
# Add a legend
> legend("topleft", legend = c("Line 1", "Line 2", "Line 3"),
col = c("red", "blue", "green"), lty = 1)
Explanation:
✓ We create the same sample data
✓ The plot() function is used to create an empty plot with appropriate labels and
limits.
✓ Use the lines() function to plot each line one by one. The type = "l" argument
specifies that we want to plot lines, and the col argument sets the color of each
line.
✓ Finally, the legend() function is used to add a legend to the plot.

• The matplot() function is a convenient way to plot multiple lines in one chart when you
have a dataset in a wide format. Here’s an example:
Ex: # Create sample data
> x <- 1:10
> y1 <- c(1, 4, 3, 6, 5, 8, 7, 9, 10, 2)
> y2 <- c(2, 5, 4, 7, 6, 9, 8, 10, 3, 1)
> y3 <- c(3, 6, 5, 8, 7, 10, 9, 2, 4, 1)
# Plot multiple lines using matplot
> matplot(x, cbind(y1, y2, y3), type = "l", lty = 1, col = c("red", "blue", "green"),
xlab = "X", ylab = "Y", main = "Multiple Lines Plot")
# Add a legend
> legend("topleft", legend = c("Line 1", "Line 2", "Line 3"),
col = c("red", "blue", "green"), lty = 1)

Explanation:
✓ We first create sample data for the x-axis (x) and three lines (y1, y2, y3).
✓ The matplot() function is then used to plot the lines. We pass the x-axis values (x)
and a matrix of y-axis values (cbind(y1, y2, y3)) as input.
✓ The type = "l" argument specifies that we want to plot lines.
✓ The lty = 1 argument sets the line type to solid.
✓ The col argument specifies the colors of the lines.
✓ The xlab, ylab, and main arguments set the labels for the x-axis, y-axis, and the
main title of the plot, respectively.
✓ Finally, the legend() function is used to add a legend to the plot, indicating the col
ors and labels of the lines.
Advantages:
• It is beneficial for showing changes and trends over different time periods.
• It is also helpful to show small changes that are difficult to measure in other graphs.
• Line graph is common and effective charts because they are simple, easy to understand, and
efficient.
• It is useful to highlight anomalies within and across data series.
• More than one line may be plotted on the same axis as a form of comparison.

Disadvantages:
• Plotting too many lines over the graph makes it cluttered and confusing to read.
• A wide range of data is challenging to plot over a line graph.
• They are only ideal for representing data that have numerical values and total figures such as
values of total rainfall in a month.
• If consistent scales on the axis aren't used, it might lead to the data of a line graph appearing
inaccurate.
• Also, line graph is inconvenient if you have to plot fractions or decimal numbers.

SCATTER PLOT
A scatter plot is a set of dotted points representing individual data pieces on the horizontal and
vertical axis. In a graph in which the values of two variables are plotted along the X-axis and Y-
axis, the pattern of the resulting points reveals a correlation between them.
The scatter plots are used to compare variables. A comparison between variables is required when
we need to define how much one variable is affected by another variable. In a scatterplot, the data
is represented as a collection of points. Each point on the scatterplot defines the values of the two
variables. One variable is selected for the vertical axis and other for the horizontal axis.
In R, there are two ways of creating scatterplot, i.e., using plot() function and using the ggplot2
package's functions. The following syntax for creating scatterplot in R:
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used:
• x is the dataset whose values are the horizontal coordinates.
• y is the dataset whose values are the vertical coordinates
• main is the title of the graph.
• xlab is the label on the horizontal axis.
• ylab is the label on the vertical axis.
• xlim is the limits of the x values which is used for plotting.
• ylim is the limits of the values of y, which is used for plotting.
• axes indicate whether both axes should be drawn on the plot.
Create a simple scatterplot chart: In order to create Scatterplot chart, we use the data set and
use the columns.
Ex1: # create a dataset for Xaxis and Yaxis.
> x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
> y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
# Plotting the chart
> plot(x, y)

Ex2: #Fetching two columns from mtcars


> data <-mtcars[, c('wt', 'mpg')]
# Plotting the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
> plot(x = data$wt, y = data$mpg, xlab = "Weight", ylab = "Milage", xlim = c(2.5, 5), ylim
= c(15, 30), main = "Weight v/s Milage")

Compare Plots: To compare the plot with another plot, use the points() function
Ex: # day one, the age and speed of 12 cars
> x1 <- c(5,7,8,7,2,2,9,4,11,12,9,6)
> y1 <- c(99,86,87,88,111,103,87,94,78,77,85,86)
# day two, the age and speed of 15 cars
> x2 <- c(2,2,8,1,15,8,12,9,7,3,11,4,7,14,12)
> y2 <- c(100,105,84,105,90,99,90,95,94,100,79,112,91,80,85)
> plot(x1, y1, main="Observation of Cars", xlab="Car age", ylab="Car speed",
col="red", cex=2)
> points(x2, y2, col="blue", cex=2)
Scatterplot using ggplot2: In R, there is another way for creating scatterplot i.e. with the help of
ggplot2 package. The ggplot2 package provides ggplot() and geom_point() function for creating
a scatterplot. The ggplot() function takes a series of the input item. The first parameter is an input
vector, and the second is the aes() function in which we add the x-axis and y-axis.
Ex: # Install ggplot2 package into machine if it is not installed.
> install.packages("ggplot2")
#Loading ggplot2 package
> library(ggplot2)
# Plotting the chart using ggplot() and geom_point() functions.
> ggplot(mtcars, aes(x = drat, y = mpg)) +geom_point()

Scatterplot with groups:


Ex: #Loading ggplot2 package
> library(ggplot2)
# Plotting the chart using ggplot() and geom_point() functions.
#The aes() function inside the geom_point() function controls the color of the group.
> ggplot(mtcars, aes(x = drat, y = mpg)) + geom_point(aes(color = factor(gear)))
Changes in axis:
Ex: #Loading ggplot2 package
> library(ggplot2)
# Plotting the chart using ggplot() and geom_point() functions.
#The aes() function inside the geom_point() function controls the color of the group.
> ggplot(mtcars, aes(x = log(mpg), y = log(drat))) + geom_point(aes(color=factor(gear)))

Scatterplot with fitted values:


Ex: #Loading ggplot2 package
> library(ggplot2)
#Creating scatterplot with fitted values.
# An additional function stst_smooth is used for linear regression.
> ggplot(mtcars, aes(x = log(mpg), y = log(drat))) + geom_point(aes(color =
factor(gear))) + stat_smooth(method = "lm",col = "#C42126",se = FALSE,size = 1)

Adding information to the graph:


Ex: #Loading ggplot2 package
> library(ggplot2)
#Creating scatterplot with fitted values.
# An additional function stst_smooth is used for linear regression.
> new_graph<-ggplot(mtcars, aes(x = log(mpg), y = log(drat))) +geom_point(aes(color =
factor(gear))) + stat_smooth(method = "lm",col = "#C42126",se = FALSE,size = 1)
#in above example lm is used for linear regression and se stands for standard error
> new_graph+labs(title = "Scatterplot with more information")

Scatterplot Matrices: When we have two or more variables and we want to correlate between
one variable and others so we use a R scatterplot matrix. pairs() function is used to create
R matrices of scatterplots. The following syntax for creating scatterplot matrices in R:
pairs(formula, data)
Following is the description of the parameters used:
• formula: This parameter represents the series of variables used in pairs.
• data: This parameter represents the data set from which the variables will be taken.

Ex: # Plot the matrices between 4 variables giving 12 plots.


# One variable with 3 others and total 4 variables.
> pairs(~wt + mpg + disp + cyl, data = mtcars, main = "Scatterplot Matrix")
REGRESSION
Regression analysis is a group of statistical processes used in R programming and statistics to
determine the relationship between dataset variables. Generally, regression analysis is used to
determine the relationship between the dependent and independent variables of the dataset.
Regression analysis helps to understand how dependent variables (response variable) change
when one of the independent variables (predictor variable) changes and other independent
variables are kept constant. This helps in building a regression model and further, helps in
forecasting the values with respect to a change in one of the independent variables. On the basis
of types of dependent variables, a number of independent variables, and the shape of the
regression line, there are 4 types of regression analysis techniques i.e., Linear Regression,
Logistic Regression, Multinomial Logistic Regression, and Ordinal Logistic Regression.

LINEAR REGRESSION
In Linear Regression predictor and response variables are related through an equation, where
exponent (power) of both these variables is 1. Linear regression is used to predict the value of an
outcome variable y on the basis of one or more input predictor variables x. In other words, linear
regression is used to establish a linear relationship between the predictor and response variables.
In linear regression, predictor and response variables are related through an equation in which the
exponent of both these variables is 1. Mathematically, a linear relationship denotes a straight line,
when plotted as a graph.
There is the following general mathematical equation for linear regression:
y = ax + b
Here,
• y is a response variable and x is a predictor variable.
• a and b are constants that are called the coefficients.
Steps for establishing the Regression
A simple example of regression is predicting a weight of a person when his height is known. To
predict the weight, we need to have a relationship between the height and weight of a person.
There are the following steps to create the relationship:
1. In the first step, we carry out the experiment of gathering a sample of observed values of
height and weight.
2. After that, we create a relationship model using the lm() function of R.
3. Next, we will find the coefficient with the help of the model and create the mathematical
equation using this coefficient.
4. We will get the summary of the relationship model to understand the average error in
prediction, known as residuals.
5. At last, we use the predict() function to predict the weight of the new person.
lm() function: This function creates the relationship model between the predictor and the response
variable. The basic syntax for lm() function in linear regression is:
lm(formula, data)
Here,
• formula is a symbol that presents the relationship between x and y.
• data is a vector on which we will apply the formula.
predict() Function: Predict the values based on existing values. Now, we will predict the weight
of new persons with the help of the predict() function. The basic syntax for predict() in linear
regression is:
predict(object, newdata)
Here,
• object is the formula that we have already created using the lm() function.
• newdata is the vector that contains the new value for the predictor variable.
Ex: Below is the sample data representing the observations
> # Creating input vector for lm() function
> # The predictor vector.
> x <- c(141, 134, 178, 156, 108, 116, 119, 143, 162, 130)
> # The response vector.
> y <- c(62, 85, 56, 21, 47, 17, 76, 92, 62, 58)
> # Applying the lm() function.
> relationship_model<- lm(y~x)
> #Printing the coefficient
> print(relationship_model)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
47.50833 0.07276
> #Printing the coefficient
> print(summary(relationship_model))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-38.948 -7.390 1.869 15.933 34.087
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.50833 55.18118 0.861 0.414
x 0.07276 0.39342 0.185 0.858
Residual standard error: 25.96 on 8 degrees of freedom
Multiple R-squared: 0.004257, Adjusted R-squared: -0.1202
F-statistic: 0.0342 on 1 and 8 DF, p-value: 0.8579
Ex: Predict the weight of new persons
> # The predictor vector.
> x <- c(141, 134, 178, 156, 108, 116, 119, 143, 162, 130)
> # The response vector.
> y <- c(62, 85, 56, 21, 47, 17, 76, 92, 62, 58)
> # Apply the lm() function.
> relationship_model<- lm(y~x)
> # Find weight of a person with height 170.
> z <- data.frame(x = 170)
> predict_result<- predict(relationship_model,z)
> print(predict_result)
1
59.87736
Plotting Regression: Visualize the Regression Graphically using plot() function. plot out
prediction results with the help of the plot() function. This function takes parameter x and y as an
input vector and many more arguments.
Ex: #Creating input vector for lm() function
> x <- c(141, 134, 178, 156, 108, 116, 119, 143, 162, 130)
> y <- c(62, 85, 56, 21, 47, 17, 76, 92, 62, 58)
> # Applying the lm() function.
> relationship_model<- lm(y~x)
> # Plotting the chart.
> plot(y, x, col = "red", main = "Height and Weight Regression", abline(lm(x~y)),
cex = 1.3, pch = 16, xlab = "Weight in Kg", ylab = "Height in cm")

MULTIPLE LINEAR REGRESSION


Multiple regression is an extension of linear regression into relationship between more than two
variables. In simple linear relation we have one predictor and one response variable, but in multiple
regression we have more than one predictor variable (x1, x2, x3) and one response variable y.
The general mathematical equation for multiple regression is:
y=b0+b1x1+b2x2+b3x3+ . . . . +bnxn
Here,
• y is the response variable.
• b0, b1, b2, b3...bn are the coefficients.
• x1, x2, x3...xn are the predictor variables.
We create the regression model using the lm() function in R. The model determines the value of
the coefficients using the input data. Next, we can predict the value of the response variable for a
given set of predictor variables using these coefficients.
lm() Function: This function creates the relationship model between the predictor and the
response variable. The basic syntax for lm() function in multiple regression is:
lm(y ~ x1+x2+x3..., data)
Here,
• formula is a symbol presenting the relation between the response variable and predictor
variables.
• data is the vector on which the formula will be applied.
Ex:
Input Data: Consider the data set "mtcars" available in the R environment. It gives a comparison
between different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"),
horse power("hp"), weight of the car("wt") and some more parameters.
The goal of the model is to establish the relationship between "mpg" as a response variable with
"disp","hp" and "wt" as predictor variables. We create a subset of these variables from the mtcars
data set for this purpose.
> input <- mtcars[,c("mpg","disp","hp","wt")]
> print(head(input))
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
Creating Relationship Model and finding Coefficient: Now, we will use the data which we have
created before to create the Relationship Model. We will use the lm() function, which takes two
parameters i.e., formula and data. Let's start understanding how the lm() function is used to create
the Relationship Model.
# Creating input data.
> input <- mtcars[, c("mpg","disp","hp","wt")]
# Creating the relationship model.
> Model <- lm(mpg~wt+disp+hp, data = input)
# Show the Model.
> print(Model)
Call:
lm(formula = mpg ~ wt + disp + hp, data = input)
Coefficients:
(Intercept) wt disp hp
37.105505 -3.800891 -0.000937 -0.031157
# Get the Intercept and coefficients as vector elements.
> cat("# # # # The Coefficient Values # # # ","\n")
# # # # The Coefficient Values # # #
> a <- coef(model)[1]
> print(a)
(Intercept)
37.10551
> Xdisp <- coef(model)[2]
> Xhp <- coef(model)[3]
> Xwt <- coef(model)[4]
> print(Xdisp)
disp
-0.0009370091
> print(Xhp)
hp
-0.03115655
> print(Xwt)
wt
-3.800891
Create Equation for Regression Model: Based on the above intercept and coefficient values, we
create the mathematical equation.
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
Apply Equation for predicting New Values: We can use the regression equation created above
to predict the mileage when a new set of values for displacement, horse power and weight is
provided. For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104

You might also like