50% found this document useful (2 votes)
648 views14 pages

Shamsundar M2 Project2

This document provides instructions for creating various data visualizations using the BullTroutRML2 dataset. It includes 20 steps to plot scatter plots, histograms, density plots with shading, and plots with symbols and colors representing different categories. Key aspects covered are summarizing the dataset, filtering it to a specific location, and plotting the variables to show their relationships and distributions. The visualizations help analyze trends, outliers, and how points are distributed. Completing the steps builds skills in exploring and visualizing data in R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
648 views14 pages

Shamsundar M2 Project2

This document provides instructions for creating various data visualizations using the BullTroutRML2 dataset. It includes 20 steps to plot scatter plots, histograms, density plots with shading, and plots with symbols and colors representing different categories. Key aspects covered are summarizing the dataset, filtering it to a specific location, and plotting the variables to show their relationships and distributions. The visualizations help analyze trends, outliers, and how points are distributed. Completing the steps builds skills in exploring and visualizing data in R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Module 2

ALY6000 Introduction to Analytics

Vaishnavi Shamsundar

10/03/2020

Executive Summary Report 2


Module 2

Key findings about the data


1) Print your name at the top of the script. Include the prefix: “Plotting Basics:” such that it
appears “Plotting Basics: Lastname

2) Import libraries including: FSA, FSAdata, magrittr, dplyr, plotrix, ggplot2, and moments
NOTE: You must use R version 3.6.3 to gain access to the FSA data set. If you installed a
later version of R, you must uninstall Rstudio and R. Then reinstall R version 3.6.3; then
reinstall Rstudio.

3) Load the BullTroutRML2 dataset (BullTroutRML2.csv) NOTE: The dataset is already


imported into your project when you added the FSA and FSAdata libraries.

4) Print the first and last 3 records from the BullTroutRMS2 dataset
Module 2

5) Remove all records except those from Harrison Lake

6) Display the first and last 5 records from the filtered BullTroutRML2 dataset

7) Display the structure of the filtered BullTroutRML2dataset


Module 2

8) Display the summary of the filtered BullTroutRML2dataset

9) Create a scatterplot for “age” (y variable) and “fl” (x variable) with the following
specifications:
 Limit of x axis is (0,500)
 Limit of y axis is (0,15)
 Title of graph is “Plot 1: Harrison Lake Trout
 Y axis label is “Age (yrs)”  X axis label is “Fork Length (mm)”
 Use a small filled circle for the plotted data points

10) Plot an “Age” histogram with the following specifications


 Y axis label is “Frequency”
 X axis label is “Age (yrs)”
 Title of the histogram is “Plot 2: Harrison Fish Age Distribution” X and Y axis limits is
0, 15
 The color of the frequency plots is “cadetblue”
 The color of the Title is “cadetblue”

11) Create an overdense plot using the same specifications as the previous scatterplot. But,
 Title the plot “Plot 3: Harrison Density Shaded by Era”
 Y axis label is “Age (yrs)”
 Y axis limits are 0 to 15
 X axis label is “Fork Length (mm)”
 X axis limits are 0 to 500
 include two levels of shading for the “green” data points.
 Plot solid circles as data points
Module 2

12) Create a new object called “tmp” that includes the first 3 and last 3 records of the
BullTroutRML2 data set

13) Display the “era” column (variable) in the new “tmp” object

14) Create a pchs vector with the argument values for + and x

15) Create a cols vector with the two elements “red” and “gray60”
Module 2

16) Convert the tmp era values to numeric values.

17) Initialize the cols vector with the tmp era values

18) Create a plot of “Age (yrs)” (y variable) versus “Fork Length (mm)” (x variable) with the
following specifications:
 Title of graph is “Plot 4: Symbol & Color by Era”
 Limit of x axis is (0,500)
 Limit of y axis is (0,15)
 X axis label is “Age (yrs)”
 Y axis label is “Fork Length (mm)”
 Set pch equal to pchs era values
 Set col equal to cols era values

19) Plot a regression line overlay on Plot 4 and title the new graph “Plot 5: Regression
Overlay”
Module 2

20) Place a legend of on Plot 5 and call the new graph “Plot 6: :Legend Overlay”

The three aspects of the assignments


A) From the summary we can get median, mean ,1st, 3rd quartile ,min and max element.
We found that the mean of the dataset BullTroutRMS2(age) is 5.771 and mean of the
dataset BullTroutRMS2(forklength) is 326.1 and the median of the dataset
BullTroutRMS2(age) is 6 and median of the dataset BullTroutRMS2(forklength) is 352.5
And the skewness of age is 2.92 and of fl is112.20 and the Standard deviation, Variance
and kurtosis values are represented in the screenshots below. From the skewness we can
observe that the graph is asymmetric below the median or below the peak of the graph in
both age and fl.

The outliers help in understanding the number of points outside the range of the dataset
given. The outlier can be calculated using boxplot graph which represents the minimum,
maximum, mean and median of the dataset. Below, the boxplot of the age is calculated as
we can see there are no points outside the minimum and maximum which means there are
no points outside the range of the dataset and hence the outliers for the given dataset of
age is Zero.
Module 2

The outlier of the dataset of forklength(fl) is calculated below. In the screenshot below,
we can observe that there are outliers in the forklength which means there are points
which goes beyond the range of the dataset. That is outliers going beyond 500 and lying
within 15 which are the ranges as shown in the figure below.

B) Drawing the data visualization of the Scatter plot, Histogram, Regression line, Regression
line with legend and Boxplot. The scatter plot represents the dependency nature with an
uphill graph which means as the forklength increases the age also increases and vice
versa. The highest peak can be seen at (450,14). The histogram of age peaks at (7,4) and
it has the same peak for two bins at 11. The next red and the gray plot shows the same
result as the scatter plot which shows there is a direct relation between the age and the
frequency. In regression, the slop of the line is the heart and soul of the equation and tells
how the variables are dependent on each other. In regression line we can see the a peak
point touching the line which is at (450,14). The legend represents the data in a clear way
to understand the graph in a better way. The boxplot of the forklength represents the
outliers of the graph where the points lie outside the range of the forklength.
Module 2

C) By observing the data visualization of all the graphs and plots we can come to a
conclusion that all the minimal and maximum number and the maximum peak occurs
same place with exact magnitude in all the graphs and plots. In the boxplot of age, we
observed there is no outliers since all the points lie in the range of age unlike forklength
which has two outliers.
Module 2

Bibliography:

1. Bluman Elementary Statistics: A Step by Step Approach 10th edition, McGraw Hill 


ISBN 978-1-260-04200-9

2. R. Kabacoff,  R in Action 2nd edition, Manning, ISBN 978-1-617-29138-8

3. Holtz, Y. (n.d.). Data visualization with R and ggplot2: The R Graph Gallery. Retrieved
October 02, 2020, from https://www.r-graph-gallery.com/ggplot2-package.html

4. [email protected], R. (n.d.). R Tutorial. Retrieved October 02, 2020, from


https://www.statmethods.net/r-tutorial/index.html

5. Jariwala, D. (2016, December 29). 7 Visualizations You Should Learn in R. Retrieved


October 02, 2020, from https://www.r-bloggers.com/2016/12/7-visualizations-you-
should-learn-in-r/
Module 2

Appendix:

#ALY6000: Executive Summary Report 1


#Dataset Instructions
#Import libraries including: FSA, FSAdata, magrittr, dplyr, plotrix, ggplot2, and
moments
install.packages("FSA")
install.packages("FSAdata")
install.packages("magrittr")
install.packages("dplyr")
install.packages("plotrix")
install.packages("ggplot2")
install.packages("ggplot2")
install.packages("readr")
install.packages("moments")
install.packages("car")
library(moments)
library(car)
library(readr)
library(FSA)
library(FSAdata)
library(magrittr)
library(dplyr)
library(plotrix)
library(ggplot2)
library(ggplot2)
library(plotrix)
library(moments)

#Print your name at the top of the script. Include the prefix: “Plotting Basics:” such that it
#appears “Plotting Basics: Lastname
print("Plotting Basics : Vaishnavi Shamsundar")

#Load the BullTroutRML2 dataset (BullTroutRML2.csv)


data("BullTroutRML2")
x<-BullTroutRML2
x

#plotting the statistics of the data set BullTroutRML2


Module 2

summary(x)
sd(x$age)
sd(x$fl)
var(x$age)
var(x$fl)
skewness(x$age)
skewness(x$fl)
kurtosis(x$age)
kurtosis(x$fl)
boxplot(x$age)$out
boxplot(x$fl)$out

setwd("C:/Users/Vaishu/Desktop")

#Print the first and last 3 records from the BullTroutRMS2 dataset
head(BullTroutRML2,1)
tail(BullTroutRML2,3)

#Remove all records except those from Harrison Lake (hint: use the <filterD() function)
BullTroutRML2 %>%
k<-c(filterD(BullTroutRML2,lake == "Harrison"))
k<-data.frame(k)
k

#Display the first and last 5 records from the filtered BullTroutRML2 dataset
head(k,1)
tail(k,5)

#Display the structure of the filtered BullTroutRML2dataset


str(k)

#Display the summary of the filtered BullTroutRML2dataset


summary(k)

#Create a scatterplot for “age” (y variable) and “fl” (x variable)


plot(x= k$fl, y=k$age,
xlab = "Fork Length (mm)",
ylab = "Age(yrs)",
xlim = c(0,500),
ylim = c(0,15),
main = "Plot1:Horrison Lake Trout", pch=16)
Module 2

#Plot an “Age” histogram with the following specifications


hist(k$age,
main="Plot 2: Harrison Fish Age Distribution”",
xlab="Age (yrs)",
ylab="Frequency",
border="blue",
col="cadetblue",
col.main="cadetblue",
xlim=c(0,15),
ylim=c(0,15),)

#Create an overdense plot using the same specifications as the previous scatterplot
smoothScatter(k$fl,k$age,
xlim=c(0,500),ylim=c(0,15),
xlab="Fork Length(mm)", ylab="Age(yrs)",
main="Plot 3 : Harrison Density Shaded by Era",
pch= 17)

#Create a new object called “tmp” that includes the first 3 and last 3 records of the
# data set.
a<-head(k,3)
b<-tail(k,3)
temp<-rbind(a,b)
temp

#Display the “era” column (variable) in the new “tmp” object


temp[, c("era"), drop=FALSE]

#Create a pchs vector with the argument values for + and x


pchs<- c("+","x")

#Create a cols vector with the two elements “red” and “gray60”
cols<- c("red", "gray60")

#Convert the tmp era values to numeric values


temp_era<-as.numeric(temp$era)
temp_era
Module 2

#Initialize the cols vector with the tmp era values


temp_era<-cols
temp_era

#Create a plot of “Age (yrs)” (y variable) versus “Fork Length (mm)” (x variable)
plot(x= k$fl, y = k$age, xlim = c(0,500), ylim = c(0,15), pch= ifelse(
k$era == "1977-80", pchs[1], pchs[2]),
col= ifelse(k$era == "1977-80", cols[1], cols[2]),
xlab = "Age(yrs)", ylab = "Fork Length(mm)",
main = "Plot 4: Symbol & Color by Era")

#Plot a regression line overlay on Plot 4 and title the new graph
plot(x= k$fl, y = k$age, xlim = c(0,500), ylim = c(0,15), pch= ifelse(
k$era == "1977-80", pchs[1], pchs[2]),
col= ifelse(k$era == "1977-80", cols[1], cols[2]),
xlab = "Age(yrs)", ylab = "Fork Length(mm)",
main = "Plot 5: Regression Overlay")
abline(lm(k$age~k$fl, data=k),col="blue")

#Place a legend of on Plot 5 and call the new graph “Plot 6: :Legend Overlay”
plot(x= k$fl, y = k$age, xlim = c(0,500), ylim = c(0,15), pch= ifelse(
k$era == "1977-80", pchs[1], pchs[2]),
col= ifelse(k$era == "1977-80", cols[1], cols[2]),
xlab = "Age(yrs)", ylab = "Fork Length(mm)",
main = "Plot 6: Legend Overlay")
abline(lm(k$age~k$fl, data=k),col="blue")
legend("topleft", c("1977-80","1997-01"),pch=c("+","x"),col=c("red","grey"))

You might also like