0% found this document useful (0 votes)
67 views60 pages

On Eda

Uploaded by

Neeraja Bhukya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views60 pages

On Eda

Uploaded by

Neeraja Bhukya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Exploratory data analysis

Exploratory Data Analysis (EDA) is an


approach to analyze the data using visual
techniques. It is used to discover trends,
patterns, or to check assumptions with the
help of statistical summary and graphical
representations.
R-line charts
A line chart is a graph that connects a series
of points by drawing line segments between
them. These points are ordered in one of their
coordinate (usually the x-coordinate) value.
The plot() function in R is used to create the
line graph.
Syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab,main)
v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l"
to draw only the lines and "o" to draw both points
and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.
Example
A simple line chart is created using the input
vector and the type parameter as "O".
# Create the data for the chart.
 v <- c(7,12,28,3,41)
# Plot the bar chart.
 plot(v,type = "o")
Line Chart Title, Color and Labels
The features of the line chart can be expanded by
using additional parameters. We add color to the
points and lines, give a title to the chart and add
labels to the axes.
# Create the data for the chart.
 v <- c(7,12,28,3,41)
# Give the chart file a name.
 # Plot the bar chart.
 plot(v,type = "o", col = "red", xlab = "Month",
ylab = "Rain fall", main = "Rain fall chart")
Multiple Lines in a Line Chart

 More than one line can be drawn on the same chart


by using the lines()function.
 After the first line is plotted, the lines() function can
use an additional vector as input to draw the second
line in the chart,
 # Create the data for the chart.
 v <- c(7,12,28,3,41)
 t <- c(14,7,6,19,3)
 # Plot the bar chart. plot(v,type = "o",col = "red",
xlab = "Month", ylab = "Rain fall", main = "Rain fall
chart")
 lines(t, type = "o", col = "blue")
Pie Charts

. A pie-chart is a representation of values as


slices of a circle with different colors. The
slices are labeled and the numbers
corresponding to each slice is also
represented in the chart.
In R the pie chart is created using
the pie() function which takes positive
numbers as a vector input.
Syntax
pie(x, labels, radius, main, col, clockwise)
x is a vector containing the numeric values used in
the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie
chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices
are drawn clockwise or anti clockwise.
# Create data for the graph.
x <- c(21, 62, 10, 53)
 labels <- c("London", "New York",
"Singapore", "Mumbai")
 # Plot the chart.
 pie(x,labels)
Pie Chart Title and Colors
We can expand the features of the chart by
adding more parameters to the function. We
will use parameter main to add a title to the
chart and another parameter is col which will
make use of rainbow colour pallet while
drawing the chart. The length of the pallet
should be same as the number of values we
have for the chart. Hence we use length(x).
# Create data for the graph.
 x <- c(21, 62, 10, 53)
labels <- c("London", "New York",
"Singapore", "Mumbai")
# Give the chart file a name.
 # Plot the chart with title and rainbow color
pallet. pie(x, labels, main = "City pie chart",
col = rainbow(length(x)))
 Slice Percentages and Chart Legend
 # Create data for the graph.
 x <- c(21, 62, 10,53)
 labels <- c("London","NewYork","Singapore","Mumbai")
 piepercent<- round(100*x/sum(x), 1)
 # Give the chart file a name.
 png(file = "city_percentage_legends.jpg")
 # Plot the chart.
 pie(x, labels = piepercent, main = "City pie chart",col =
rainbow(length(x))) legend("topright", c("London","New
York","Singapore","Mumbai"), cex = 0.8, fill =
rainbow(length(x)))
 # Save the file.
 dev.off()
 3D Pie Chart
 A pie chart with 3 dimensions can be drawn using additional
packages. The package plotrix has a function called pie3D() that is
used for this.
 # Get the library.
 library(plotrix)
 # Create data for the graph.
 x <- c(21, 62, 10,53)
 lbl <- c("London","New York","Singapore","Mumbai")
 # Give the chart file a name.
 png(file = "3d_pie_chart.jpg")
 # Plot the chart
 . pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ")
 # Save the file.
 dev.off()
A bar chart represents data in rectangular
bars with length of the bar proportional to
the value of the variable. R uses the
function barplot() to create bar charts. R can
draw both vertical and Horizontal bars in the
bar chart. In bar chart each of the bars can
be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col) Following is
the description of the parameters used −
H is a vector or matrix containing numeric values used
in bar chart.
xlab is the label for x axis.
ylab is the label for y axis.
main is the title of the bar chart.
names.arg is a vector of names appearing under each
bar.
col is used to give colors to the bars in the graph.
R Bar Charts:
A bar chart is a pictorial representation in which
numerical values of variables are represented by length
or height of lines or rectangles of equal width. A bar
chart is used for summarizing a set of categorical data.
In bar chart, the data is shown through rectangular
bars having the length of the bar proportional to the
value of the variable.
R provides the barplot() function, which has the
following syntax:
barplot(h,x,y,main, names.arg,col)
1.H :A vector or matrix which contains
numeric values used in the bar chart.
2.xlab : A label for the x-axis.
3.ylab : A label for the y-axis.
4.main :A title of the bar chart.
5.names.arg : A vector of names that appear
under each bar.
6.col :It is used to give colors to the bars in
the graph.
# Creating the data for Bar chart
H<- c(12,35,54,3,41)
# Giving the chart file a name
png(file = "bar_chart.png")
# Plotting the bar chart
barplot(H)
# Saving the file
dev.off()
Labels, Title & Colors:
Like pie charts, we can also add more
functionalities in the bar chart by-passing
more arguments in the barplot() functions.
We can add a title in our bar chart or can add
colors to the bar by adding the main and col
parameters, respectively. We can add another
parameter i.e., args.name, which is a vector
that has the same number of values, which
are fed as the input vector to describe the
meaning of each bar.
 Let's see an example to understand how labels, titles, and colors are
added in our bar chart.
 # Creating the data for Bar chart
 H <- c(12,35,54,3,41)
 M<- c("Feb","Mar","Apr","May","Jun")

 # Giving the chart file a name
 png(file = "bar_properties.png")

 # Plotting the bar chart
 barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="Green",
 main="Revenue Bar chart",border="red")
 # Saving the file
 dev.off()
Group Bar Chart & Stacked Bar Chart
We can create bar charts with groups of bars
and stacks using matrices as input values in
each bar. One or more variables are
represented as a matrix that is used to
construct group bar charts and stacked bar
charts.
 Let's see an example to understand how these charts are created.
 library(RColorBrewer)
 months <- c("Jan","Feb","Mar","Apr","May")
 regions <- c("West","North","South")
 # Creating the matrix of the values.
 Values <- matrix(c(21,32,33,14,95,46,67,78,39,11,22,23,94,15,16), nrow = 3, ncol
= 5, byrow = TRUE)
 # Giving the chart file a name
 png(file = "stacked_chart.png")
 # Creating the bar chart
 barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month", ylab
= "Revenue", ccol =c("cadetblue3","deeppink2","goldenrod1"))
 # Adding the legend to the chart
 legend("topleft", regions, cex = 1.3, fill = c("cadetblue3","deeppink2","goldenrod1"
))

 # Saving the file
 dev.off()
R Boxplot
Boxplots are a measure of how well data is
distributed across a data set. This divides the data
set into three quartiles. This graph represents the
minimum, maximum, average, first quartile, and
the third quartile in the data set. Boxplot is also
useful in comparing the distribution of data in a
data set by drawing a boxplot for each of them.
R provides a boxplot() function to create a boxplot.
There is the following syntax of boxplot() function:
boxplot(x, data, notch, varwidth, names, main)
1.x : It is a vector or a formula.
2.data : It is the data frame.
3. notch : It is a logical value set as true to
draw a notch.
4.varwidth : It is also a logical value set as
true to draw the width of the box same as the
sample size.
5.names : It is the group of labels that will be
printed under each boxplot.
6.main : It is used to give a title to the graph.
Example
# Giving a name to the chart file.
png(file = "boxplot.png")
# Plotting the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Quantit
y of Cylinders",
 ylab = "Miles Per Gallon", main = "R Boxplot
Example")

# Save the file.
dev.off()
Boxplot using notch
In R, we can draw a boxplot using a notch. It
helps us to find out how the medians of
different data groups match with each other.
Let's see an example to understand how a
boxplot graph is created using notch for each
of the groups.
 Example
 # Giving a name to our chart.
 png(file = "boxplot_using_notch.png")
 # Plotting the chart.
 boxplot(mpg ~ cyl, data = mtcars,
 xlab = "Quantity of Cylinders",
 ylab = "Miles Per Gallon",
 main = "Boxplot Example",
 notch = TRUE,
 varwidth = TRUE,
 ccol = c("green","yellow","red"),
 names = c("High","Medium","Low")
)
 # Saving the file.
 dev.off()
Violin Plots
R provides an additional plotting scheme
which is created with the combination of
a boxplot and a kernel density plot. The
violin plots are created with the help of
vioplot() function present in the vioplot
package.
Let's see an example to understand the
creation of the violin plot.
 # Loading the vioplot package
 library(vioplot)
 # Giving a name to our chart.
 png(file = "vioplot.png")
 #Creating data for vioplot function
 x1 <- mtcars$mpg[mtcars$cyl==4]
 x2 <- mtcars$mpg[mtcars$cyl==6]
 x3 <- mtcars$mpg[mtcars$cyl==8]
 #Creating vioplot function
 vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"),
 col="green")
 #Setting title
 title("Violin plot example")
 # Saving the file.
 dev.off()
Bagplot- 2-Dimensional Boxplot
Extension
The bagplot(x, y) function in
the aplpack package provides a biennial
version of the univariate boxplot. The bag
contains 50% of all points. The bivariate
median is approximate. The fence separates
itself from the outside points, and the outlays
are displayed.
Example
# Loading aplpack package
library(aplpack)
# Giving a name to our chart.
png(file = "bagplot.png")
#Creating bagplot function
attach(mtcars)
bagplot(wt,mpg, xlab="Car Weight", ylab="Miles Per Ga
llon",
 main="2D Boxplot Extension")
# Saving the file.
dev.off()
R Histogram
A histogram is a type of bar chart which shows the
frequency of the number of values which are compared
with a set of values ranges. The histogram is used for the
distribution, whereas a bar chart is used for comparing
different entities. In the histogram, each bar represents
the height of the number of values present in the given
range.
For creating a histogram, R provides hist() function,
which takes a vector as an input and uses more
parameters to add more functionality. There is the
following syntax of hist() function:
hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border)
1.v : It is a vector that contains numeric values.
2.main : It indicates the title of the chart.
3.col : It is used to set the color of the bars.
4. border : It is used to set the border color of each
bar. 5. xlab : It is used to describe the x-axis.
6.ylab : It is used to describe the y-axis.
7.xlim : It is used to specify the range of values on
the x-axis.
8.ylim : It is used to specify the range of values on
the y-axis.
9.breaks : It is used to mention the width of each bar.
 Example
 # Creating data for the graph.
 v <- c(12,24,16,38,21,13,55,17,39,10,60)

 # Giving a name to the chart file.
 png(file = "histogram_chart.png")

 # Creating the histogram.
 hist(v,xlab = "Weight",ylab="Frequency",col = "green",bord
er = "red")

 # Saving the file.
 dev.off()
 Example: Use of xlim & ylim parameter
 # Creating data for the graph.
 v <- c(12,24,16,38,21,13,55,17,39,10,60)

 # Giving a name to the chart file.
 png(file = "histogram_chart_lim.png")

 # Creating the histogram.
 hist(v,xlab = "Weight",ylab="Frequency",col = "green",borde
r = "red",xlim = c(0,40), ylim = c(0,3), breaks = 5)

 # Saving the file.
 dev.off()
R Scatterplots
The scatter plots are used to compare variables. A
comparison between variables is required when
we need to define how much one variable is
affected by another variable. In a scatterplot, the
data is represented as a collection of points. Each
point on the scatterplot defines the values of the
two variables. One variable is selected for the
vertical axis and other for the horizontal axis. In R,
there are two ways of creating scatterplot, i.e.,
using plot() function and using the ggplot2
package's functions.
 There is the following syntax for creating scatterplot in R:
 plot(x, y, main, xlab, ylab, xlim, ylim, axes)
 1.x : It is the dataset whose values are the horizontal
coordinates.
 2.y : It is the dataset whose values are the vertical coordinates.
 3.main : It is the title of the graph.
 4.xlab : It is the label on the horizontal axis.
 5.ylab : It is the label on the vertical axis.
 6.xlim : It is the limits of the x values which is used for plotting.
 7.ylim : It is the limits of the values of y, which is used for
plotting.
 8.axes : It indicates whether both axes should be drawn on the
plot.
Example
#Fetching two columns from mtcars
data <-mtcars[,c('wt','mpg')]
# Giving a name to the chart file.
png(file = "scatterplot.png")
# Plotting the chart for cars with weight between 2.5 t
o 5 and mileage between 15 and 30.
plot(x = data$wt,y = data$mpg, xlab = "Weight", ylab
= "Milage", xlim = c(2.5,5), ylim = c(15,30), main = "
Weight v/sMilage")
# Saving the file.
dev.off()
Scatterplot using ggplot2
The ggplot2 package provides ggplot() and
geom_point() function for creating a
scatterplot. The ggplot() function takes a
series of the input item. The first parameter is
an input vector, and the second is the aes()
function in which we add the x-axis and y-
axis.
#Loading ggplot2 package
library(ggplot2)
# Giving a name to the chart file.
png(file = "scatterplot_ggplot.png")
# Plotting the chart using ggplot() and geom_
point() functions.
ggplot(mtcars, aes(x = drat, y = mpg)) +geo
m_point()
# Saving the file.
dev.off()

You might also like