Case Study
Case Study
Aim:
Identify a diamond around a price of 500$ with best attributes for variables carat,
cut, color and clarity.
Dataset:
Diamonds data set is used from ggplot2 package.
Inference:
Code:
Output:
Data sets in package ‘ggplot2’:
Insight:
In this two functions has been used
library()
data()
library(): this function is used to load the package
Argument passed :
“package” : the name of a package, given as a
name or literal character string, or a character
string.
“package name” passed in this command was
ggplot2
data():this function was used to view the list of available data sets available.
Argument passed :
“package” : a character vector giving the
package(s) to look in for data sets
parameter passed in this command was
'ggplot2' which is a package name
A new tab will get open after executing the
data(package='ggplot2') command.
New opened tab contain list of all the available datasets in in
given package that is 'ggplot2'.
Code:
Output:
output for str(diamonds):
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Insight:
In this two functions has been used
str()
summary()
str():Compactly display the internal structure of an R object,
Argument passed :
“object” : any R object about which you want to
have some information.
“object” passed in this command was diamonds( which
is a dataset )
Shows information about diamond data set.
summary():generic function used to produce result summaries of the results
of various model fitting functions
Argument passed :
“object” : an object for which a summary is
desired
“object” passed in this command was diamonds( which
is a dataset )
Shows summary about diamond data set like mean,
median, minimum.
Code:
Output:
Insight:
CONCLUSION:
1. This is a long tail distribution histogram, with a high
concentration of observations below the 5,000 mark.
2. X axis we have price of diamonds and on y axis we have the
respective count on diamonds
3. aes is aesthetics function in ggplot u can add layers on it
data=diamonds aes(x=price)) these are layers
Code:
Output:
Insight:
Code:
Output:
library(dplyr)
> diamonds %>% filter(price<1000) %>% as.data.frame() -> diamonds1000
> dim(diamonds1000)
[1] 14499 10
Insight:
library(): this function is used to load the package
Argument passed :
“package” : the name of a package, given as a
name or literal character string, or a character
string.
“package name” passed in this command was
dplyr
dplyr: Dplyr is a powerful package that performs
common data manipulation tasks in R like sorting,
summarizing, variable selection and creation,
among others.
#diamonds %>% filter(price<1000) %>% as.data.frame() -> diamonds1000
diamond dataset has been passed to filter function
filter() function find rows/cases where conditions are true
in filter() function we have passed the condition that is price
of diamonds should be less than 1000.
This function will get the rows or cases having price less than
1000.
Output is than converted to data frame
# dim(diamonds1000) used to find the dimension of diamonds1000
CONCLUSION
o New dataset diamonds1000 has been formed having price of diamonds less
than 1000
o Number of diamonds with price less than 1000 are 14499 from 53940
Code:
Output:
Insight:
Code:
Output:
[1] 14439 10
Insight:
#diamonds1000 %>% filter(carat<0.512) %>% as.data.frame()-> diamonds1000512
Diamonds1000 dataset has been passed to filter function
filter() function find rows/cases where conditions are true
in filter() function we have passed the condition that is carat of diamonds
should be less than 0.512.
This function will get the rows or cases having carat less than 0.512.
Output is than converted to data frame
# dim(diamonds1000512) used to find the dimension of diamonds1000512
CONCLUSION
o Getting the data set with carat less than 0.512
o New dataset diamonds1000512 has been formed having carat of diamonds
should be less than 0.512.
o Number of diamonds with carat less than 0.512 are 14439 from 14499
Code:
Output:
Insight:
Code:
Output:
Insight:
Code:
Output:
[1] 6821 10
Insight:
CONCLUSION
o Getting the data set with cut of diamond equal to “IDEAL”
o New dataset diamonds1000512I has been formed having cut of diamonds
should be equal to “ideal”
o Number of diamonds with cut equal to “ideal” are 6821 from 14439
Code:
Output:
Insight:
Code:
Output:
[1] 959 10
Insight:
diamonds1000512I %>% filter(color == 'D') %>% as.data.frame() ->
diamonds1000512ID
diamonds1000512I dataset has been passed to filter function
filter() function find rows/cases where conditions are true
in filter() function we have passed the condition that is color of
diamonds should be equal to “D”
This function will get the rows or cases having color equal to
“D”.
Output is than converted to data frame
# dim(diamonds1000512ID) used to find the dimension of diamonds1000512ID
CONCLUSION
o Getting the data set with color of diamonds should be equal to “D”
o New dataset diamonds1000512ID has been formed having color equal to
“D”.
o Number of diamonds with color equal to “D” are 959 from 6921
Code:
Output:
Insight:
Code:
Output:
Insight:
ggplot(): initializes a ggplot object. It can be used to declare the input data
frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
Argument passed :
Data: dataset to use for plot.
In this the data set passed is diamonds1000512ID
aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
Argument passed in aesthetic unction (aes()) were
X: assigned the values of clarity of the
diamonds
Y assigned the values of price of diamonds
geom_boxplot():function to draw box plot for one continuous
and one discrete variable.
fill=rainbow(8):argument passed to
geom_boxplot() to fill rainbow color in box
plot
CONCLUSION
o Result is a box plot having clarity on x axis and price on y axis
o Plot have rain bow color filled in it
o Box plot is being drawn to see how many percentage of diamond are
having let say “p” price with respect to let say “c1” clarity
o For example: diamonds having VVS1 clarity 50 % have price less
than 900
Code:
Output:
[1] 21 10
Insight:
diamonds1000512ID %>% filter(clarity == 'VVS1') %>% as.data.frame() ->
diamonds1000512IDVVS1
diamonds1000512ID dataset has been passed to filter function
filter() function find rows/cases where conditions are true
in filter() function we have passed the condition that is clarity of
diamonds should be equal to “VVS1”
This function will get the rows or cases having clarity equal to
“VVS1”.
Output is than converted to data frame
# dim(diamonds1000512IDVVS1) used to find the dimension of
diamonds1000512IDVVS1
CONCLUSION
o Getting the data set with clarity VVS1
o New dataset diamonds1000512IDVVS1 has been formed having clarity
equal to “VVS”.
o Number of diamonds with clarity equal to “VVS1”. 21 from 959
Code:
Output:
Insight:
ggplot(): initializes a ggplot object. It can be used to declare the input data
frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
Argument passed :
Data: dataset to use for plot.
In this the data set passed is diamonds1000512IDVVS1
aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
Argument passed in aesthetic unction (aes()) was :
X= List of name value pairs giving aesthetics to
map to variables.
In this case x is price
Plot in this case was histogram:
Function used for drawing histogram was
geom_histogram()
Histograms (geom_histogram()) display the counts
with bars. That is why histogram has been used
Histogram is for the single continuous variable i.e
price
CONCLUSION:
o Resultant plot is histogram having price on x axis and count on
y axis
o Histogram shows Price distribution from final set
Code
Output:
carat cut color clarity depth table price x y z
1 0.27 Ideal D VVS1 61.7 57 586 4.16 4.2 2.58
Insight: