IDS Unit-5
IDS Unit-5
R Programming Language uses the function pie() to create pie charts. It takes positive numbers
as a vector input.
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used
• x is a vector containing the numeric values used in the pie chart.
• labels is used to give description to the slices.
• radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
• main indicates the title of the chart
• col indicates the color palette.
• clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Creating a simple pie chart: To create a simple R pie chart
✓ by using the above parameters, we can draw a pie chart
✓ It can be described by giving simple labels.
Ex: # Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
# Plot the cart
> pie(slices, labels = lbls, main="Pie Chart of Countries")
Ex: # Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
# Plot the cart
> pie(slices, labels = lbls, main="Pie Chart of Countries", clockwise = TRUE, col=colors())
Pie chart including the title and colors: To create a color and title pie chart.
✓ Take all parameters which are required to make a R pie chart by giving a title to the chart
and adding labels.
✓ We can add more features by adding more parameters with more colors to the points.
Ex: # Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
# Plot the chart with title and rainbow color pallet
> pie(slices, labels = lbls, main="Pie Chart of Countries", col = rainbow(length(slices)))
Slice Percentage: Slice percentage is one of the property of the pie chart. We can show the chart
in the form of percentages as well as add legends.
Ex1: Pie Chart with Slice Percentages
# Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> pct <- round(slices/sum(slices)*100)
# Print percent’s
> pct
[1] 20 24 8 32 16
> pie(slices, labels = pct, col=rainbow(length(lbls)), main="Pie Chart of Countries")
Ex2: # Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
# calculate percent’s to slices
> pct <- round(slices/sum(slices)*100)
# Print percent’s
> pct
[1] 20 24 8 32 16
# add percent’s to labels
> lbls <-paste(lbls, pct)
# Print labels
> lbls
[1] "US 20" "UK 24" "Australia 8" "Germany 32" "France 16"
# add % to labels
> lbls<-paste(lbls, "%")
# Print labels
> lbls
[1] "US 20 %" "UK 24 %" "Australia 8 %" "Germany 32 %" "France 16 %"
> pie(slices, labels = lbls, main="Pie Chart of Countries", col=rainbow(length(lbls)))
Chart Legend: The legend is a side section of the chart that gives a small text description of
each series.
Ex1: Pie Chart with Chart legend
# Creating data for the graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> pct <- round(slices/sum(slices)*100)
# Plot the chart
> pie(slices, labels = pct, main="Pie Chart of Countries", col=rainbow(length(lbls)))
> legend("topleft", c("US", "UK", "Australia", "Germany", "France"), cex = 0.4,
fill = rainbow(length(slices)))
Add pie chart color palettes: With the help of .pal function of the RColorBrewer package in
R.
Ex: # Install RColorBrewer package if not installed in RStudio
> install.packages("RColorBrewer")
# Get the library
> library(RColorBrewer)
# Create the data for graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> labl<- brewer.pal(length(slices), "Set2")
> pie(slices, labl=lbls)
modify the line type of the borders of the plot we can make use of the lty argument:
# Get the library
> library(RColorBrewer)
# Create the data for graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> labl<- brewer.pal(length(slices), "Set2")
> pie(slices, labl=lbls, lty = 2, col=rainbow(length(lbls)))
Add shading lines with the density argument:
# Get the library
> library(RColorBrewer)
# Create the data for graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> labl<- brewer.pal(length(slices), "Set2")
> pie(slices, labl=lbls, density = 50, angle = 45, col=rainbow(length(lbls)))
3D Pie Chart: The pie3D( ) function in the plotrix package provides pie3D() function to plot 3D
pie chart.
# Install plotrix package if not installed in RStudio
> install.packages("plotrix")
# Get the library
> library(plotrix)
# Create the data for graph
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> pie3D(slices,labels=lbls,explode=0.1, main="Pie Chart of Countries")
Advantages
• Display relative proportions of multiple classes of data.
• Size of the circle can be made proportional to the total quantity it represents.
• Summarize a large data set in visual form.
• Be visually simpler than other types of graphs.
• Permit a visual check of the reasonableness or accuracy of calculations.
Disadvantages
• Do not easily reveal exact values.
• Many pie charts may be needed to show changes over time.
• Fail to reveal key assumptions, causes, effects, or patterns.
• Be easily manipulated to yield false impressions
BAR CHART OR BAR PLOT
• Bar charts are a popular and effective way to visually represent categorical data in a
structured manner. R is a powerful programming language for data analysis and
visualization.
• A bar chart also known as bar graph is a pictorial representation of data that presents
categorical data with rectangular bars with heights or lengths proportional to the values
that they represent. In other words, it is the pictorial representation of the dataset. These
data sets contain the numerical values of variables that represent the length or height.
• R uses the function barplot( ) to create bar charts. R can draw both vertical and horizontal bars in
the bar chart. In bar chart each of the bars can be given different colors.
• A bar graph is a chart that uses bars to show comparisons between categories of data. A bar graph
will have two axes. One axis will describe the types of categories being compared, and the other
will have numerical values that represent the values of the data. It does not matter which axis is
which, but it will determine what bar graph is shown. If the descriptions are on the horizontal axis,
the bars will be oriented vertically, and if the values are along the horizontal axis, the bars will be
oriented horizontally.
• Syntax: barplot(H, xlab, ylab, main, names.arg, col)
Following is the description of the parameters used −
✓ H is a vector or matrix containing numeric values used in bar chart.
✓ xlab is the label for x axis.
✓ ylab is the label for y axis.
✓ main is the title of the bar chart.
✓ names.arg is a vector of names appearing under each bar.
✓ col is used to give colors to the bars in the graph.
• Types of Bar Plot: There are four types of bar diagrams, they are
1. Simple Bar plot
2. Multiple Bar plot
3. Sub-divided Bar plot or Component Bar plot
4. Deviation Bars
Simple Bar plot: Simple Bar plot is used to compare two or more independent variables. Each
variable will relate to a fixed value. The values are positive and therefore, can be fixed to the
horizontal value. In order to create a Bar Chart:
✓ A vector (H <- c(Values…)) is taken which contains numeral values to be used.
✓ This vector H is plot using barplot().
Creating a Simple Bar Chart in R:
Ex1: # Create a data for chart
> A <- c(17, 32, 8, 53, 10)
# Plot the bar chart
> barplot(A, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart")
Density / Bar Texture: To change the bar texture, use the density parameter
Ex: > A <- c(17, 15, 21, 29, 16, 22)
> B <- c("Jan", "feb", "Mar", "Apr", "May", "Jun")
> barplot(A, names.arg = B, density = 10)
Creating a Horizontal Bar Chart in R: To create a horizontal bar chart:
1. Take all parameters which are required to make a simple bar chart.
2. Now to make it horizontal new parameter is added.
barplot(A, horiz=TRUE)
Ex: # Create the data for the chart
> A <- c(17, 32, 8, 53, 1)
> barplot(A, horiz = TRUE, xlab = "X-axis", ylab = "Y-axis", main ="Horizontal Bar
Chart")
Adding Label, Title and Color in the BarChart: Label, title and colors are some properties in
the bar chart which can be added to the bar by adding and passing an argument.
• To add the title in bar chart.
barplot( A, main = title_name )
• X-axis and Y-axis can be labeled in bar chart. To add the label in bar chart.
barplot( A, xlab= x_label_name, ylab= y_label_name)
• To add the color in bar chart.
barplot( A, col=color_name)
Ex: # Create the data for the chart
> A <- c(17, 15, 21, 29, 16, 22)
> B <- c("Jan", "feb", "Mar", "Apr", "May", "Jun")
# Plot the bar chart
> barplot(A, names.arg = B, xlab ="Month", ylab ="No.of working days",
col = rainbow(5), main ="Month wise working days")
Add Data Values on the Bar:
Ex: # Create the data for the chart
> A <- c(17, 15, 21, 29, 16, 22)
> B <- c("Jan", "feb", "Mar", "Apr", "May", "Jun")
# Plot the bar chart
> barplot(A, names.arg = B, xlab ="Month", ylab ="No.of working days",
col =" steelblue", main ="Month wise working days", cex.main = 1.5,
cex.lab = 1.2, cex.axis = 1.1)
# Add data labels on top of each bar
> text(x = barplot(A, names.arg = B, col = "steelblue", ylim = c(0, max(A) * 1.2)),
y = A + 1, labels = A, pos = 3, cex = 1.2, col = "black")
GROUPED BAR PLOT OR MULTIPLE BAR PLOT: Multiple bar chart is an extension of
simple bar chart. Grouped bars are used to represent related sets of data. For example, imports and
exports of a country together are shown in multiple bar chart. Each bar in a group is shaded or
colored differently for the sake of distinction.
Ex: > colors = c("green", "orange", "brown")
> months <- c("Mar", "Apr", "May", "Jun", "Jul")
> regions <- c("East", "West", "North")
> Values <- matrix(c(2, 9, 3, 11, 9, 4, 8, 7, 3, 12, 5, 2, 8, 10, 11),
nrow = 3, ncol = 5, byrow = TRUE)
> Values
[,1] [,2] [,3] [,4] [,5]
[1,] 2 9 3 11 9
[2,] 4 8 7 3 12
[3,] 5 2 8 10 11
> barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month",
ylab = "Revenue", col = colors, beside = TRUE)
> legend("topleft", regions, cex = 0.7, fill = colors)
SUB-DIVIDED BAR PLOT OR COMPONENT BAR PLOT: This chart consists of bars which
are sub-divided into two or more parts. This type of diagram shows the variation in different
components within each class as well as between different classes. Sub-divided bar plot is also
known as component bar chart or staked chart.
Ex: > colors = c("green", "orange", "brown")
> months <- c("Mar", "Apr", "May", "Jun", "Jul")
> regions <- c("East", "West", "North")
> Values <- matrix(c(2, 9, 3, 11, 9, 4, 8, 7, 3, 12, 5, 2, 8, 10, 11),
nrow = 3, ncol = 5, byrow = TRUE)
> Values
[,1] [,2] [,3] [,4] [,5]
[1,] 2 9 3 11 9
[2,] 4 8 7 3 12
[3,] 5 2 8 10 11
> barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month",
ylab = "Revenue", col = colors)
> legend("topleft", regions, cex = 0.7, fill = colors)
DEVIATION BAR PLOT: A graph displays a deviation relationship when it features how one
or more sets of quantitative values differ from a reference set of values. The graph does this by
directly expressing the differences between two sets of values
Deviation bars are used to represent net quantities - excess or deficit i.e. net profit, net loss, net
exports or imports, swings in voting etc. Such bars have both positive and negative values. Positive
values lie above the base line and negative values lie below it.
Ex: > cars <- c(12, -4, 56, 2, -12, 45, -25)
> barplot(cars, col="light blue")
Advantages:
• Show each data category in a frequency distribution
• Display relative numbers/proportions of multiple categories
• Summarize a large amount of data in a visual, easily interpretable form
• Make trends easier to highlight than tables do
• Estimates can be made quickly and accurately
• Permit visual guidance on accuracy and reasonableness of calculations
• Accessible to a wide audience
Disadvantages:
• Often require additional explanation
• Fail to expose key assumptions, causes, impacts and patterns
• Can be easily manipulated to give false impressions
BOX PLOT
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into
three quartiles. This graph represents the minimum, maximum, median, first quartile and third
quartile in the data set. It is also useful in comparing the distribution of data across data sets by
drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function. The syntax for boxplots is follows.
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
• x is a vector or a formula.
• data is the data frame.
• notch is a logical value. Set as TRUE to draw a notch.
• varwidth is a logical value. Set as true to draw width of the box proportionate to the sample
size.
• names are the group labels which will be printed under each boxplot.
• main is used to give a title to the graph.
Create boxplot horizontally: We can draw boxplot horizontally by making horizontal as TRUE,
the default value is FALSE.
Ex: # Create the data for the chart
> x <- c(7, 3, 2, 4, 8)
> boxplot(x, col="orange", horizontal = TRUE, border = "blue")
Ex: # Create the data for the chart
> input <- mtcars[,c('mpg', 'cyl')]
> print(head(input))
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
> boxplot(mpg ~ cyl, data = mtcars, xlab ="Number of Cylinder", ylab = "Miles Per Gallon",
main = "Mileage Data", col= c("green", "royalblue", "red"))
•In R, the function boxplot() can also take in formulas of the form y~x where y is a numeric
vector which is grouped according to the value of x.
• For example, in our dataset mtcars, the mileage per gallon mpg is grouped according to
the number of cylinders cyl present in cars.
Boxplot using notch: In R, we can draw a boxplot using a notch. It helps us to find out how the
medians of different data groups match with each other. Let's see an example to understand how a
boxplot graph is created using notch for each of the groups.
Ex: # Create the data for the chart
> input <- mtcars[,c('mpg','cyl')]
> print(head(input))
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
> boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data",notch = TRUE, varwidth = TRUE,
col = c("green","yellow","purple"),names = c("High", "Medium", "Low"))
Violin Plots: R provides an additional plotting scheme which is created with the combination of
a boxplot and a kernel density plot. The violin plots are created with the help of vioplot() function
present in the vioplot package.
Ex: # Install vioplot package if not available in system.
> install.packages("vioplot")
# Loading the vioplot package
> library(vioplot)
# Creating data for vioplot function
> x1 <- mtcars$mpg[mtcars$cyl==4]
> x2 <- mtcars$mpg[mtcars$cyl==6]
> x3 <- mtcars$mpg[mtcars$cyl==8]
# Creating vioplot function
> vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"), col="green")
Bagplot 2-Dimensional: The bagplot(x, y) function in the aplpack package provides a biennial
version of the univariate boxplot. The bag contains 50% of all points. The bivariate median is
approximate. The fence separates itself from the outside points, and the outlays are displayed.
Advantages:
• A box plot is a good way to summarize large amounts of data.
• It displays the range and distribution of data along a number line.
• Box plots provide some indication of the data’s symmetry and skew-ness.
• Box plots show outliers.
Disadvantages
• Original data is not clearly shown in the box plot and also, mean and mode cannot be
identified in a box plot.
• Exact values not retained.
HISTOGRAM
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical intervals. A
graphical representation that manages a group of data points into different specified ranges.
A histogram is a type of bar chart which shows the frequency of the number of values which are
compared with a set of values ranges. Histogram is similar to bar chat but the difference is, the
histogram is used for the distribution, whereas a bar chart is used for comparing different entities.
In the histogram, each bar represents the height of the number of values present in the given range.
It has a special feature that shows no gaps between the bars and is similar to a vertical bar graph.
For creating a histogram, R provides hist() function, which takes a vector as an input and uses
more parameters to add more functionality. There is the following syntax of hist() function:
hist(v, main, xlab, ylab, xlim, ylim, breaks, col, border)
Following is the description of the parameters used −
• v is a vector containing numeric values used in histogram.
• main indicates title of the chart.
• col is used to set color of the bars.
• border is used to set border color of each bar.
• xlab is used to give description of x-axis.
• xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• breaks is used to mention breakpoints between histogram cells
• counts: The count of values in a particular range.
• mids: center point of multiple cells.
• density: cell density
Creating a simple Histogram in R: Creating a simple histogram chart by using the above
parameter. This vector v is plot using hist().
Ex1: # Creating data for the graph.
> v <- c(12, 24, 16, 38, 21, 13, 55, 17, 39, 10, 60)
# Creating the histogram.
> h1=hist(v, xlab = "Weight", ylab="Frequency", col = "green", border = "red")
Use of xlim & ylim parameter: To specify the range of values allowed in X axis and Y axis, we
can use the xlim and ylim parameters. The width of each of the bar can be decided by using breaks.
Ex: # Creating data for the graph.
> v <- c(9,13,31,8,31,22,12,31,35)
# Creating the histogram.
> hist(v, xlab = "Weight", col = "light green", border = "red", xlim = c(0, 40),
ylim = c(0, 5), breaks = 5)
Using histogram return values for labels using text(): The following example utilizes the
function text() which add text to a plot and return values to place the count above each cell.
Ex1: # Creating data for the graph.
> v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39, 120, 40, 70, 90)
# Creating the histogram.
> m<-hist(v, xlab = "Weight", ylab ="Frequency", col = "darkmagenta",
border = "pink", breaks = 5)
> text(m$mids, m$counts, labels = m$counts, adj = c(0.5, -0.5))
Kernel Density Plots: Kernal density plots are usually a much more effective way to view the
distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.
Ex1: # Kernel Density Plot
> v <- c(9,13,21,8,36,22,12,41,31,33,19)
> d <- density(v)
>d
Call:
density.default(x = v)
Data: v (11 obs.); Bandwidth 'bw' = 6.387
x y
Min. :-11.16 Min. :6.854e-05
1st Qu. : 6.67 1st Qu. :2.653e-03
Median : 24.50 Median :1.531e-02
Mean : 24.50 Mean :1.399e-02
3rd Qu. : 42.33 3rd Qu. :2.281e-02
Max. : 60.16 Max. :2.907e-02
> plot(d)
Ex2: # Filled Density Plot
> v <- c(9,13,21,8,36,22,12,41,31,33,19)
> d <- density(v)
> plot(d, main="Kernel Density of Miles Per Gallon")
> polygon(d, col="steelblue2", border="tomato1")
Advantages:
• Visually strong.
• Can compare to normal curve.
• Usually vertical axis is a frequency count of item falling in to each category.
Disadvantages:
• Cannot read exact values because data is grouped in categories.
• More difficult to compare two data sets.
• Use only with continuous data
LINE GRAPH:
A line graph is a pictorial representation of information which changes continuously over time. A
line graph can also be referred to as a line chart. A line chart is a graph that connects a series of
points by drawing line segments between them. Within a line graph, there are points connecting
the data to show the continuous change. The lines in a line graph can move up and down based on
the data. We can use a line graph to compare different events, information, and situations. These
points are ordered in one of their coordinates (usually the x-coordinate) value. Line charts are
usually used in identifying the trends in data.
For developing line graph, R provides plot() function, which has the following syntax:
plot(v, type, col, xlab, ylab)
Following is the description of the parameters used:
• v is a vector which contains the numeric values.
• type is parameter has the following value:
✓ p: This value is used to draw only the points.
✓ l: This value is used to draw only the lines.
✓ o: This value is used to draw both points and lines
• xlab is the label for the x-axis.
• ylab is the label for the y-axis.
• main is the title of the chart.
• col is used to give the color for both the points and lines.
Creating a Simple Line Graph: Simple line graph is created using the type parameter as “o”
and input vector.
Ex: # Creating the data for the chart.
> v <- c(13, 22, 28, 7, 31)
# Plotting the bar chart.
> plot(v, type = "p")
• The matplot() function is a convenient way to plot multiple lines in one chart when you
have a dataset in a wide format. Here’s an example:
Ex: # Create sample data
> x <- 1:10
> y1 <- c(1, 4, 3, 6, 5, 8, 7, 9, 10, 2)
> y2 <- c(2, 5, 4, 7, 6, 9, 8, 10, 3, 1)
> y3 <- c(3, 6, 5, 8, 7, 10, 9, 2, 4, 1)
# Plot multiple lines using matplot
> matplot(x, cbind(y1, y2, y3), type = "l", lty = 1, col = c("red", "blue", "green"),
xlab = "X", ylab = "Y", main = "Multiple Lines Plot")
# Add a legend
> legend("topleft", legend = c("Line 1", "Line 2", "Line 3"),
col = c("red", "blue", "green"), lty = 1)
Explanation:
✓ We first create sample data for the x-axis (x) and three lines (y1, y2, y3).
✓ The matplot() function is then used to plot the lines. We pass the x-axis values (x)
and a matrix of y-axis values (cbind(y1, y2, y3)) as input.
✓ The type = "l" argument specifies that we want to plot lines.
✓ The lty = 1 argument sets the line type to solid.
✓ The col argument specifies the colors of the lines.
✓ The xlab, ylab, and main arguments set the labels for the x-axis, y-axis, and the
main title of the plot, respectively.
✓ Finally, the legend() function is used to add a legend to the plot, indicating the col
ors and labels of the lines.
Advantages:
• It is beneficial for showing changes and trends over different time periods.
• It is also helpful to show small changes that are difficult to measure in other graphs.
• Line graph is common and effective charts because they are simple, easy to understand, and
efficient.
• It is useful to highlight anomalies within and across data series.
• More than one line may be plotted on the same axis as a form of comparison.
Disadvantages:
• Plotting too many lines over the graph makes it cluttered and confusing to read.
• A wide range of data is challenging to plot over a line graph.
• They are only ideal for representing data that have numerical values and total figures such as
values of total rainfall in a month.
• If consistent scales on the axis aren't used, it might lead to the data of a line graph appearing
inaccurate.
• Also, line graph is inconvenient if you have to plot fractions or decimal numbers.
SCATTER PLOT
A scatter plot is a set of dotted points representing individual data pieces on the horizontal and
vertical axis. In a graph in which the values of two variables are plotted along the X-axis and Y-
axis, the pattern of the resulting points reveals a correlation between them.
The scatter plots are used to compare variables. A comparison between variables is required when
we need to define how much one variable is affected by another variable. In a scatterplot, the data
is represented as a collection of points. Each point on the scatterplot defines the values of the two
variables. One variable is selected for the vertical axis and other for the horizontal axis.
In R, there are two ways of creating scatterplot, i.e., using plot() function and using the ggplot2
package's functions. The following syntax for creating scatterplot in R:
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used:
• x is the dataset whose values are the horizontal coordinates.
• y is the dataset whose values are the vertical coordinates
• main is the title of the graph.
• xlab is the label on the horizontal axis.
• ylab is the label on the vertical axis.
• xlim is the limits of the x values which is used for plotting.
• ylim is the limits of the values of y, which is used for plotting.
• axes indicate whether both axes should be drawn on the plot.
Create a simple scatterplot chart: In order to create Scatterplot chart, we use the data set and
use the columns.
Ex1: # create a dataset for Xaxis and Yaxis.
> x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
> y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
# Plotting the chart
> plot(x, y)
Compare Plots: To compare the plot with another plot, use the points() function
Ex: # day one, the age and speed of 12 cars
> x1 <- c(5,7,8,7,2,2,9,4,11,12,9,6)
> y1 <- c(99,86,87,88,111,103,87,94,78,77,85,86)
# day two, the age and speed of 15 cars
> x2 <- c(2,2,8,1,15,8,12,9,7,3,11,4,7,14,12)
> y2 <- c(100,105,84,105,90,99,90,95,94,100,79,112,91,80,85)
> plot(x1, y1, main="Observation of Cars", xlab="Car age", ylab="Car speed",
col="red", cex=2)
> points(x2, y2, col="blue", cex=2)
Scatterplot using ggplot2: In R, there is another way for creating scatterplot i.e. with the help of
ggplot2 package. The ggplot2 package provides ggplot() and geom_point() function for creating
a scatterplot. The ggplot() function takes a series of the input item. The first parameter is an input
vector, and the second is the aes() function in which we add the x-axis and y-axis.
Ex: # Install ggplot2 package into machine if it is not installed.
> install.packages("ggplot2")
#Loading ggplot2 package
> library(ggplot2)
# Plotting the chart using ggplot() and geom_point() functions.
> ggplot(mtcars, aes(x = drat, y = mpg)) +geom_point()
Scatterplot Matrices: When we have two or more variables and we want to correlate between
one variable and others so we use a R scatterplot matrix. pairs() function is used to create
R matrices of scatterplots. The following syntax for creating scatterplot matrices in R:
pairs(formula, data)
Following is the description of the parameters used:
• formula: This parameter represents the series of variables used in pairs.
• data: This parameter represents the data set from which the variables will be taken.
LINEAR REGRESSION
In Linear Regression predictor and response variables are related through an equation, where
exponent (power) of both these variables is 1. Linear regression is used to predict the value of an
outcome variable y on the basis of one or more input predictor variables x. In other words, linear
regression is used to establish a linear relationship between the predictor and response variables.
In linear regression, predictor and response variables are related through an equation in which the
exponent of both these variables is 1. Mathematically, a linear relationship denotes a straight line,
when plotted as a graph.
There is the following general mathematical equation for linear regression:
y = ax + b
Here,
• y is a response variable and x is a predictor variable.
• a and b are constants that are called the coefficients.
Steps for establishing the Regression
A simple example of regression is predicting a weight of a person when his height is known. To
predict the weight, we need to have a relationship between the height and weight of a person.
There are the following steps to create the relationship:
1. In the first step, we carry out the experiment of gathering a sample of observed values of
height and weight.
2. After that, we create a relationship model using the lm() function of R.
3. Next, we will find the coefficient with the help of the model and create the mathematical
equation using this coefficient.
4. We will get the summary of the relationship model to understand the average error in
prediction, known as residuals.
5. At last, we use the predict() function to predict the weight of the new person.
lm() function: This function creates the relationship model between the predictor and the response
variable. The basic syntax for lm() function in linear regression is:
lm(formula, data)
Here,
• formula is a symbol that presents the relationship between x and y.
• data is a vector on which we will apply the formula.
predict() Function: Predict the values based on existing values. Now, we will predict the weight
of new persons with the help of the predict() function. The basic syntax for predict() in linear
regression is:
predict(object, newdata)
Here,
• object is the formula that we have already created using the lm() function.
• newdata is the vector that contains the new value for the predictor variable.
Ex: Below is the sample data representing the observations
> # Creating input vector for lm() function
> # The predictor vector.
> x <- c(141, 134, 178, 156, 108, 116, 119, 143, 162, 130)
> # The response vector.
> y <- c(62, 85, 56, 21, 47, 17, 76, 92, 62, 58)
> # Applying the lm() function.
> relationship_model<- lm(y~x)
> #Printing the coefficient
> print(relationship_model)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
47.50833 0.07276
> #Printing the coefficient
> print(summary(relationship_model))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-38.948 -7.390 1.869 15.933 34.087
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.50833 55.18118 0.861 0.414
x 0.07276 0.39342 0.185 0.858
Residual standard error: 25.96 on 8 degrees of freedom
Multiple R-squared: 0.004257, Adjusted R-squared: -0.1202
F-statistic: 0.0342 on 1 and 8 DF, p-value: 0.8579
Ex: Predict the weight of new persons
> # The predictor vector.
> x <- c(141, 134, 178, 156, 108, 116, 119, 143, 162, 130)
> # The response vector.
> y <- c(62, 85, 56, 21, 47, 17, 76, 92, 62, 58)
> # Apply the lm() function.
> relationship_model<- lm(y~x)
> # Find weight of a person with height 170.
> z <- data.frame(x = 170)
> predict_result<- predict(relationship_model,z)
> print(predict_result)
1
59.87736
Plotting Regression: Visualize the Regression Graphically using plot() function. plot out
prediction results with the help of the plot() function. This function takes parameter x and y as an
input vector and many more arguments.
Ex: #Creating input vector for lm() function
> x <- c(141, 134, 178, 156, 108, 116, 119, 143, 162, 130)
> y <- c(62, 85, 56, 21, 47, 17, 76, 92, 62, 58)
> # Applying the lm() function.
> relationship_model<- lm(y~x)
> # Plotting the chart.
> plot(y, x, col = "red", main = "Height and Weight Regression", abline(lm(x~y)),
cex = 1.3, pch = 16, xlab = "Weight in Kg", ylab = "Height in cm")