0% found this document useful (0 votes)
26 views20 pages

Case Study

1) The document aims to identify a diamond around $500 with the best attributes for carat, cut, color, and clarity using the diamonds data set from the ggplot2 package. 2) Various functions are used to analyze and visualize the diamonds data, including filtering the data to diamonds under $1000, plotting a histogram of price, and summarizing key diamond attributes. 3) Key insights include identifying that the price distribution is long-tailed below $5,000 and filtering the data revealed over 14,000 diamonds priced under $1000.

Uploaded by

Shivani Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views20 pages

Case Study

1) The document aims to identify a diamond around $500 with the best attributes for carat, cut, color, and clarity using the diamonds data set from the ggplot2 package. 2) Various functions are used to analyze and visualize the diamonds data, including filtering the data to diamonds under $1000, plotting a histogram of price, and summarizing key diamond attributes. 3) Key insights include identifying that the price distribution is long-tailed below $5,000 and filtering the data revealed over 14,000 diamonds priced under $1000.

Uploaded by

Shivani Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Case Study

Aim:
Identify a diamond around a price of 500$ with best attributes for variables carat,
cut, color and clarity.

Dataset:
Diamonds data set is used from ggplot2 package.

Inference:
Code:

Output:
Data sets in package ‘ggplot2’:

diamonds Prices of 50,000 round cut diamonds


economics US economic time series
economics_long US economic time series
faithfuld 2d density estimate of Old Faithful data
luv_colours 'colors()' in Luv space
midwest Midwest demographics
mpg Fuel economy data from 1999 and 2008 for 38
popular models of car
msleep An updated and expanded version of the
mammals
sleep dataset
presidential Terms of 11 presidents from Eisenhower to
Obama
seals Vector field of seal movements
txhousing Housing sales in TX

Insight:
 In this two functions has been used
 library()
 data()
 library(): this function is used to load the package
 Argument passed :
 “package” : the name of a package, given as a
name or literal character string, or a character
string.
 “package name” passed in this command was
ggplot2

 data():this function was used to view the list of available data sets available.
 Argument passed :
 “package” : a character vector giving the
package(s) to look in for data sets
 parameter passed in this command was
'ggplot2' which is a package name
 A new tab will get open after executing the
data(package='ggplot2') command.
 New opened tab contain list of all the available datasets in in
given package that is 'ggplot2'.

Code:

Output:
output for str(diamonds):
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Output for summary(diamonds):


carat cut color clarity depth
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
J: 2808 (Other): 2531
table price x y z
Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median :57.00 Median : 2401 Median : 5.700 Median : 5.710 Median : 3.530
Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900 Max. :31.800

Insight:
 In this two functions has been used
 str()
 summary()
 str():Compactly display the internal structure of an R object,
 Argument passed :
 “object” : any R object about which you want to
have some information.
 “object” passed in this command was diamonds( which
is a dataset )
 Shows information about diamond data set.
 summary():generic function used to produce result summaries of the results
of various model fitting functions
 Argument passed :
 “object” : an object for which a summary is
desired
 “object” passed in this command was diamonds( which
is a dataset )
 Shows summary about diamond data set like mean,
median, minimum.

Code:

Output:

Insight:

 ggplot(): initializes a ggplot object. It can be used to declare the input data


frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) was :
X= List of name value pairs giving aesthetics to
map to variables.
 In this case x is price
 Plot dram in this case was histogram:
 Function used for drawing histogram was
geom_histogram()
 Histograms (geom_histogram()) display the counts
with bars.
 Histogram is for the single continuous variable i.e
price

 CONCLUSION:
1. This is a long tail distribution histogram, with a high
concentration of observations below the 5,000 mark.
2. X axis we have price of diamonds and on y axis we have the
respective count on diamonds
3. aes is aesthetics function in ggplot u can add layers on it
data=diamonds aes(x=price)) these are layers

Code:

Output:
Insight:

 ggplot(): initializes a ggplot object. It can be used to declare the input data


frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) was :
X= List of name value pairs giving aesthetics to
map to variables.
 In this case x is price
 Plot dram in this case was histogram:
 Function used for drawing histogram was
geom_histogram()
 Histograms (geom_histogram()) display the counts
with bars. That is why histogram has been used
 Histogram is for the single continuous variable i.e
price
 Argument passed in geom_histogram() was breaks
 breaks:  numeric vector giving the bin boundaries.
Overrides binwidth, bins, center, and boundary.
 A sequence was assigned to breaks which starts
from 0 till 2000 with increment of 500.
 CONCLUSION:
Limit has been assigned to histogram that is on x axis price starts with 0
and end at 2000 with price incrementing by 500.

Code:

Output:
library(dplyr)
> diamonds %>% filter(price<1000) %>% as.data.frame() -> diamonds1000
> dim(diamonds1000)
[1] 14499 10

Insight:
 library(): this function is used to load the package
 Argument passed :
 “package” : the name of a package, given as a
name or literal character string, or a character
string.
 “package name” passed in this command was
dplyr
 dplyr: Dplyr is a powerful package that performs
common data manipulation tasks in R like sorting,
summarizing, variable selection and creation,
among others.
 #diamonds %>% filter(price<1000) %>% as.data.frame() -> diamonds1000
 diamond dataset has been passed to filter function
 filter() function find rows/cases where conditions are true
 in filter() function we have passed the condition that is price
of diamonds should be less than 1000.
 This function will get the rows or cases having price less than
1000.
 Output is than converted to data frame
 # dim(diamonds1000) used to find the dimension of diamonds1000
 CONCLUSION
o New dataset diamonds1000 has been formed having price of diamonds less
than 1000
o Number of diamonds with price less than 1000 are 14499 from 53940

Code:

Output:

Insight:

 ggplot(): initializes a ggplot object. It can be used to declare the input data


frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds1000
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) was :
X= List of name value pairs giving aesthetics to
map to variables.
 X is assigned value return by cut function
 cut():
 arguments passed are carat and breaks
 in this carat if converted to categorical variable
 and carat has been cut into 10 intervals
 Y have been assigned the prices of diamonds
 geom_boxplot():function to draw box plot
we have used box plot to see with given
carat what is the how many percent of
diamonds have price 400,600 etc.
 CONCLUSION : this results in box plot having carat on x axis and price
on y axis showing the percentages of diamonds having particular price
o Also as the carat is increasing number of
diamonds becoming less and price is also
increasing with carat.

Code:

Output:
[1] 14439 10

Insight:
 #diamonds1000 %>% filter(carat<0.512) %>% as.data.frame()-> diamonds1000512
 Diamonds1000 dataset has been passed to filter function
 filter() function find rows/cases where conditions are true
 in filter() function we have passed the condition that is carat of diamonds
should be less than 0.512.
 This function will get the rows or cases having carat less than 0.512.
 Output is than converted to data frame
 # dim(diamonds1000512) used to find the dimension of diamonds1000512

 CONCLUSION
o Getting the data set with carat less than 0.512
o New dataset diamonds1000512 has been formed having carat of diamonds
should be less than 0.512.
o Number of diamonds with carat less than 0.512 are 14439 from 14499

Code:

Output:

Insight:

 ggplot(): initializes a ggplot object. It can be used to declare the input data


frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) was :
 X: x is assigned values of carat
 Y: y is assigned the prices of diamonds
 color: colouring the points based on cut
 geom_point(): to draw scatterplots.
 the scatterplot is most useful for displaying the
relationship between two continuous variables. 
 CONCLUSION
o The result is a scattered plot having prices on y axis , carat on x axis
and points coloured based on the cut .
o Plot shows Price distribution with respect to cut.
o Different color showing the different range of diamonds that is fair,
good ,very good etc
o Most of the diamonds are concentrated in the region having carat
ranging from 1 to 2

Code:

Output:
Insight:

 ggplot(): initializes a ggplot object. It can be used to declare the input data


frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds1000512
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) were
 X: assigned the values of cut of the diamonds
 Y assigned the values of price of diamonds
 geom_boxplot():function to draw box plot for one continuous
and one discontinuous
 CONCLUSION
o Result is a box plot having cut on x axis and price on y axis
o Box plot is being drawn to see how many percentage of diamond are
having let say “p” price with respect to let say “c” cut
o For example: diamonds having ideal cut 25 % have price less than
600,75 % have price less than 900

Code:

Output:
[1] 6821 10

Insight:

 #diamonds1000512 %>% filter(cut == 'Ideal') %>% as.data.frame() ->


diamonds1000512I
 diamonds1000512 dataset has been passed to filter function
 filter() function find rows/cases where conditions are true
 in filter() function we have passed the condition that is cut of
diamonds should be equal to “ideal”
 This function will get the rows or cases having cut equal to
“ideal”.
 Output is than converted to data frame
 # dim(diamonds1000512I) used to find the dimension of diamonds1000512I

 CONCLUSION
o Getting the data set with cut of diamond equal to “IDEAL”
o New dataset diamonds1000512I has been formed having cut of diamonds
should be equal to “ideal”
o Number of diamonds with cut equal to “ideal” are 6821 from 14439

Code:

Output:
Insight:

 ggplot(): initializes a ggplot object. It can be used to declare the input data


frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds1000512I
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) were
 X: assigned the values of colour of the
diamonds
 Y assigned the values of price of diamonds
 geom_boxplot():function to draw box plot for one continuous
and one discrete
 CONCLUSION
o Result is a box plot having color on x axis and price on y axis
o Box plot is being drawn to see how many percentage of diamond are
having let say “p” price with respect to let say “c1” color
o For example: diamonds having I color 50 % have price less than 600

Code:

Output:
[1] 959 10

Insight:
 diamonds1000512I %>% filter(color == 'D') %>% as.data.frame() ->
diamonds1000512ID
 diamonds1000512I dataset has been passed to filter function
 filter() function find rows/cases where conditions are true
 in filter() function we have passed the condition that is color of
diamonds should be equal to “D”
 This function will get the rows or cases having color equal to
“D”.
 Output is than converted to data frame
 # dim(diamonds1000512ID) used to find the dimension of diamonds1000512ID

 CONCLUSION
o Getting the data set with color of diamonds should be equal to “D”
o New dataset diamonds1000512ID has been formed having color equal to
“D”.
o Number of diamonds with color equal to “D” are 959 from 6921

Code:

Output:

Insight:

 ggplot(): initializes a ggplot object. It can be used to declare the input data


frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds1000512ID
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) were
 X: assigned the values of clarity of the
diamonds
 geom_bar():function to draw bar chart for one discrete
variable
 CONCLUSION
o Result is a bar chart having clarity on x axis and count on y axis.
o Bar chart is drawn to see number of diamonds with particular clarity.
o Height of the bar proportional to the number of cases with particular
clarity.
o Example 300 diamonds have clarity “Sl1”

Code:

Output:

Insight:
 ggplot(): initializes a ggplot object. It can be used to declare the input data
frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds1000512ID
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) were
 X: assigned the values of clarity of the
diamonds
 Y assigned the values of price of diamonds
 geom_boxplot():function to draw box plot for one continuous
and one discrete variable.
 fill=rainbow(8):argument passed to
geom_boxplot() to fill rainbow color in box
plot
 CONCLUSION
o Result is a box plot having clarity on x axis and price on y axis
o Plot have rain bow color filled in it
o Box plot is being drawn to see how many percentage of diamond are
having let say “p” price with respect to let say “c1” clarity
o For example: diamonds having VVS1 clarity 50 % have price less
than 900

Code:

Output:
[1] 21 10

Insight:
 diamonds1000512ID %>% filter(clarity == 'VVS1') %>% as.data.frame() ->
diamonds1000512IDVVS1
 diamonds1000512ID dataset has been passed to filter function
 filter() function find rows/cases where conditions are true
 in filter() function we have passed the condition that is clarity of
diamonds should be equal to “VVS1”
 This function will get the rows or cases having clarity equal to
“VVS1”.
 Output is than converted to data frame
 # dim(diamonds1000512IDVVS1) used to find the dimension of
diamonds1000512IDVVS1

 CONCLUSION
o Getting the data set with clarity VVS1
o New dataset diamonds1000512IDVVS1 has been formed having clarity
equal to “VVS”.
o Number of diamonds with clarity equal to “VVS1”. 21 from 959

Code:

Output:

Insight:
 ggplot(): initializes a ggplot object. It can be used to declare the input data
frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
 Argument passed :
 Data: dataset to use for plot. 
 In this the data set passed is diamonds1000512IDVVS1
 aes():describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
 Argument passed in aesthetic unction (aes()) was :
X= List of name value pairs giving aesthetics to
map to variables.
 In this case x is price
 Plot in this case was histogram:
 Function used for drawing histogram was
geom_histogram()
 Histograms (geom_histogram()) display the counts
with bars. That is why histogram has been used
 Histogram is for the single continuous variable i.e
price
 CONCLUSION:
o Resultant plot is histogram having price on x axis and count on
y axis
o Histogram shows Price distribution from final set

Code

Output:
carat cut color clarity depth table price x y z
1 0.27 Ideal D VVS1 61.7 57 586 4.16 4.2 2.58

Insight:

 diamonds1000512IDVVS1 %>% filter(price < 600) %>% as.data.frame() ->


finalData
 diamonds1000512IDVVS1 dataset has been passed to filter
function
 filter() function find rows/cases where conditions are true
 in filter() function we have passed the condition that is price of
diamonds should be less than 600
 This function will get the rows or cases having price less than
600.
 Output is than converted to data frame
 # finalData will give final output
 CONCLUSION
 At last we are left with one diamond of
 0.27 carat
 Ideal cut
 D as color
 VVS1 as clarity

You might also like