We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60
Exploratory data analysis
Exploratory Data Analysis (EDA) is an
approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations. R-line charts A line chart is a graph that connects a series of points by drawing line segments between them. These points are ordered in one of their coordinate (usually the x-coordinate) value. The plot() function in R is used to create the line graph. Syntax The basic syntax to create a line chart in R is − plot(v,type,col,xlab,ylab,main) v is a vector containing the numeric values. type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines. xlab is the label for x axis. ylab is the label for y axis. main is the Title of the chart. col is used to give colors to both the points and lines. Example A simple line chart is created using the input vector and the type parameter as "O". # Create the data for the chart. v <- c(7,12,28,3,41) # Plot the bar chart. plot(v,type = "o") Line Chart Title, Color and Labels The features of the line chart can be expanded by using additional parameters. We add color to the points and lines, give a title to the chart and add labels to the axes. # Create the data for the chart. v <- c(7,12,28,3,41) # Give the chart file a name. # Plot the bar chart. plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall", main = "Rain fall chart") Multiple Lines in a Line Chart
More than one line can be drawn on the same chart
by using the lines()function. After the first line is plotted, the lines() function can use an additional vector as input to draw the second line in the chart, # Create the data for the chart. v <- c(7,12,28,3,41) t <- c(14,7,6,19,3) # Plot the bar chart. plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain fall", main = "Rain fall chart") lines(t, type = "o", col = "blue") Pie Charts
. A pie-chart is a representation of values as
slices of a circle with different colors. The slices are labeled and the numbers corresponding to each slice is also represented in the chart. In R the pie chart is created using the pie() function which takes positive numbers as a vector input. Syntax pie(x, labels, radius, main, col, clockwise) x is a vector containing the numeric values used in the pie chart. labels is used to give description to the slices. radius indicates the radius of the circle of the pie chart.(value between −1 and +1). main indicates the title of the chart. col indicates the color palette. clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise. # Create data for the graph. x <- c(21, 62, 10, 53) labels <- c("London", "New York", "Singapore", "Mumbai") # Plot the chart. pie(x,labels) Pie Chart Title and Colors We can expand the features of the chart by adding more parameters to the function. We will use parameter main to add a title to the chart and another parameter is col which will make use of rainbow colour pallet while drawing the chart. The length of the pallet should be same as the number of values we have for the chart. Hence we use length(x). # Create data for the graph. x <- c(21, 62, 10, 53) labels <- c("London", "New York", "Singapore", "Mumbai") # Give the chart file a name. # Plot the chart with title and rainbow color pallet. pie(x, labels, main = "City pie chart", col = rainbow(length(x))) Slice Percentages and Chart Legend # Create data for the graph. x <- c(21, 62, 10,53) labels <- c("London","NewYork","Singapore","Mumbai") piepercent<- round(100*x/sum(x), 1) # Give the chart file a name. png(file = "city_percentage_legends.jpg") # Plot the chart. pie(x, labels = piepercent, main = "City pie chart",col = rainbow(length(x))) legend("topright", c("London","New York","Singapore","Mumbai"), cex = 0.8, fill = rainbow(length(x))) # Save the file. dev.off() 3D Pie Chart A pie chart with 3 dimensions can be drawn using additional packages. The package plotrix has a function called pie3D() that is used for this. # Get the library. library(plotrix) # Create data for the graph. x <- c(21, 62, 10,53) lbl <- c("London","New York","Singapore","Mumbai") # Give the chart file a name. png(file = "3d_pie_chart.jpg") # Plot the chart . pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ") # Save the file. dev.off() A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal bars in the bar chart. In bar chart each of the bars can be given different colors. Syntax The basic syntax to create a bar-chart in R is − barplot(H,xlab,ylab,main, names.arg,col) Following is the description of the parameters used − H is a vector or matrix containing numeric values used in bar chart. xlab is the label for x axis. ylab is the label for y axis. main is the title of the bar chart. names.arg is a vector of names appearing under each bar. col is used to give colors to the bars in the graph. R Bar Charts: A bar chart is a pictorial representation in which numerical values of variables are represented by length or height of lines or rectangles of equal width. A bar chart is used for summarizing a set of categorical data. In bar chart, the data is shown through rectangular bars having the length of the bar proportional to the value of the variable. R provides the barplot() function, which has the following syntax: barplot(h,x,y,main, names.arg,col) 1.H :A vector or matrix which contains numeric values used in the bar chart. 2.xlab : A label for the x-axis. 3.ylab : A label for the y-axis. 4.main :A title of the bar chart. 5.names.arg : A vector of names that appear under each bar. 6.col :It is used to give colors to the bars in the graph. # Creating the data for Bar chart H<- c(12,35,54,3,41) # Giving the chart file a name png(file = "bar_chart.png") # Plotting the bar chart barplot(H) # Saving the file dev.off() Labels, Title & Colors: Like pie charts, we can also add more functionalities in the bar chart by-passing more arguments in the barplot() functions. We can add a title in our bar chart or can add colors to the bar by adding the main and col parameters, respectively. We can add another parameter i.e., args.name, which is a vector that has the same number of values, which are fed as the input vector to describe the meaning of each bar. Let's see an example to understand how labels, titles, and colors are added in our bar chart. # Creating the data for Bar chart H <- c(12,35,54,3,41) M<- c("Feb","Mar","Apr","May","Jun") # Giving the chart file a name png(file = "bar_properties.png") # Plotting the bar chart barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="Green", main="Revenue Bar chart",border="red") # Saving the file dev.off() Group Bar Chart & Stacked Bar Chart We can create bar charts with groups of bars and stacks using matrices as input values in each bar. One or more variables are represented as a matrix that is used to construct group bar charts and stacked bar charts. Let's see an example to understand how these charts are created. library(RColorBrewer) months <- c("Jan","Feb","Mar","Apr","May") regions <- c("West","North","South") # Creating the matrix of the values. Values <- matrix(c(21,32,33,14,95,46,67,78,39,11,22,23,94,15,16), nrow = 3, ncol = 5, byrow = TRUE) # Giving the chart file a name png(file = "stacked_chart.png") # Creating the bar chart barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month", ylab = "Revenue", ccol =c("cadetblue3","deeppink2","goldenrod1")) # Adding the legend to the chart legend("topleft", regions, cex = 1.3, fill = c("cadetblue3","deeppink2","goldenrod1" )) # Saving the file dev.off() R Boxplot Boxplots are a measure of how well data is distributed across a data set. This divides the data set into three quartiles. This graph represents the minimum, maximum, average, first quartile, and the third quartile in the data set. Boxplot is also useful in comparing the distribution of data in a data set by drawing a boxplot for each of them. R provides a boxplot() function to create a boxplot. There is the following syntax of boxplot() function: boxplot(x, data, notch, varwidth, names, main) 1.x : It is a vector or a formula. 2.data : It is the data frame. 3. notch : It is a logical value set as true to draw a notch. 4.varwidth : It is also a logical value set as true to draw the width of the box same as the sample size. 5.names : It is the group of labels that will be printed under each boxplot. 6.main : It is used to give a title to the graph. Example # Giving a name to the chart file. png(file = "boxplot.png") # Plotting the chart. boxplot(mpg ~ cyl, data = mtcars, xlab = "Quantit y of Cylinders", ylab = "Miles Per Gallon", main = "R Boxplot Example") # Save the file. dev.off() Boxplot using notch In R, we can draw a boxplot using a notch. It helps us to find out how the medians of different data groups match with each other. Let's see an example to understand how a boxplot graph is created using notch for each of the groups. Example # Giving a name to our chart. png(file = "boxplot_using_notch.png") # Plotting the chart. boxplot(mpg ~ cyl, data = mtcars, xlab = "Quantity of Cylinders", ylab = "Miles Per Gallon", main = "Boxplot Example", notch = TRUE, varwidth = TRUE, ccol = c("green","yellow","red"), names = c("High","Medium","Low") ) # Saving the file. dev.off() Violin Plots R provides an additional plotting scheme which is created with the combination of a boxplot and a kernel density plot. The violin plots are created with the help of vioplot() function present in the vioplot package. Let's see an example to understand the creation of the violin plot. # Loading the vioplot package library(vioplot) # Giving a name to our chart. png(file = "vioplot.png") #Creating data for vioplot function x1 <- mtcars$mpg[mtcars$cyl==4] x2 <- mtcars$mpg[mtcars$cyl==6] x3 <- mtcars$mpg[mtcars$cyl==8] #Creating vioplot function vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"), col="green") #Setting title title("Violin plot example") # Saving the file. dev.off() Bagplot- 2-Dimensional Boxplot Extension The bagplot(x, y) function in the aplpack package provides a biennial version of the univariate boxplot. The bag contains 50% of all points. The bivariate median is approximate. The fence separates itself from the outside points, and the outlays are displayed. Example # Loading aplpack package library(aplpack) # Giving a name to our chart. png(file = "bagplot.png") #Creating bagplot function attach(mtcars) bagplot(wt,mpg, xlab="Car Weight", ylab="Miles Per Ga llon", main="2D Boxplot Extension") # Saving the file. dev.off() R Histogram A histogram is a type of bar chart which shows the frequency of the number of values which are compared with a set of values ranges. The histogram is used for the distribution, whereas a bar chart is used for comparing different entities. In the histogram, each bar represents the height of the number of values present in the given range. For creating a histogram, R provides hist() function, which takes a vector as an input and uses more parameters to add more functionality. There is the following syntax of hist() function: hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border) 1.v : It is a vector that contains numeric values. 2.main : It indicates the title of the chart. 3.col : It is used to set the color of the bars. 4. border : It is used to set the border color of each bar. 5. xlab : It is used to describe the x-axis. 6.ylab : It is used to describe the y-axis. 7.xlim : It is used to specify the range of values on the x-axis. 8.ylim : It is used to specify the range of values on the y-axis. 9.breaks : It is used to mention the width of each bar. Example # Creating data for the graph. v <- c(12,24,16,38,21,13,55,17,39,10,60) # Giving a name to the chart file. png(file = "histogram_chart.png") # Creating the histogram. hist(v,xlab = "Weight",ylab="Frequency",col = "green",bord er = "red") # Saving the file. dev.off() Example: Use of xlim & ylim parameter # Creating data for the graph. v <- c(12,24,16,38,21,13,55,17,39,10,60) # Giving a name to the chart file. png(file = "histogram_chart_lim.png") # Creating the histogram. hist(v,xlab = "Weight",ylab="Frequency",col = "green",borde r = "red",xlim = c(0,40), ylim = c(0,3), breaks = 5) # Saving the file. dev.off() R Scatterplots The scatter plots are used to compare variables. A comparison between variables is required when we need to define how much one variable is affected by another variable. In a scatterplot, the data is represented as a collection of points. Each point on the scatterplot defines the values of the two variables. One variable is selected for the vertical axis and other for the horizontal axis. In R, there are two ways of creating scatterplot, i.e., using plot() function and using the ggplot2 package's functions. There is the following syntax for creating scatterplot in R: plot(x, y, main, xlab, ylab, xlim, ylim, axes) 1.x : It is the dataset whose values are the horizontal coordinates. 2.y : It is the dataset whose values are the vertical coordinates. 3.main : It is the title of the graph. 4.xlab : It is the label on the horizontal axis. 5.ylab : It is the label on the vertical axis. 6.xlim : It is the limits of the x values which is used for plotting. 7.ylim : It is the limits of the values of y, which is used for plotting. 8.axes : It indicates whether both axes should be drawn on the plot. Example #Fetching two columns from mtcars data <-mtcars[,c('wt','mpg')] # Giving a name to the chart file. png(file = "scatterplot.png") # Plotting the chart for cars with weight between 2.5 t o 5 and mileage between 15 and 30. plot(x = data$wt,y = data$mpg, xlab = "Weight", ylab = "Milage", xlim = c(2.5,5), ylim = c(15,30), main = " Weight v/sMilage") # Saving the file. dev.off() Scatterplot using ggplot2 The ggplot2 package provides ggplot() and geom_point() function for creating a scatterplot. The ggplot() function takes a series of the input item. The first parameter is an input vector, and the second is the aes() function in which we add the x-axis and y- axis. #Loading ggplot2 package library(ggplot2) # Giving a name to the chart file. png(file = "scatterplot_ggplot.png") # Plotting the chart using ggplot() and geom_ point() functions. ggplot(mtcars, aes(x = drat, y = mpg)) +geo m_point() # Saving the file. dev.off()