0% found this document useful (0 votes)
7 views7 pages

Lab 8

This summary analyzes a document describing exploratory data analysis tasks performed on a diamonds dataset using R and ggplot2. Key visualizations included bar plots examining cut and color distributions, histograms of carat weights overall and in a price-filtered subset, scatter plots of carat versus price and cut versus price relationships, and box plots of price distributions by cut. Calculations on a test vector with NA values demonstrated the impact of including or removing missing data. Overall, the exploratory analysis provided insights into diamond attribute distributions, correlations, and statistical summaries.

Uploaded by

Roaster Guru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

Lab 8

This summary analyzes a document describing exploratory data analysis tasks performed on a diamonds dataset using R and ggplot2. Key visualizations included bar plots examining cut and color distributions, histograms of carat weights overall and in a price-filtered subset, scatter plots of carat versus price and cut versus price relationships, and box plots of price distributions by cut. Calculations on a test vector with NA values demonstrated the impact of including or removing missing data. Overall, the exploratory analysis provided insights into diamond attribute distributions, correlations, and statistical summaries.

Uploaded by

Roaster Guru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

#Lab - 8 Exploration

1. The first 6 rows from diamonds data set and its structure are given below. Using this
data set do the following tasks with the ggplot2 package:
diamond

a. Study the distribution of the quality of the cut (cut).


Code
ggplot(diamonds,aes(x = cut))+geom_bar()
Output
Using the ggplot2 package and the diamonds dataset, a bar plot was generated. This
plot showcased the frequency of each diamond cut category, offering a visual
representation of the distribution of cuts within the dataset.

b. Study the distribution of the weight of the diamond (carat).


Code
ggplot(diamonds,aes(x=carat))+geom_histogram(binwidth = 0.5)
Output

To visualize the distribution of diamond carat weights, a histogram was generated


using the geom_histogram function. A binwidth of 0.5 was specified to create the
plot, enabling us to examine the frequency of carat weights within distinct ranges.
c. Study the distribution of the weight of the diamond (carat) when the price (price)
is more than 6000$.
Code
ggplot(subset(diamonds,price>6000),aes(x = carat))+geom_histogram(binwidth =
0.5)

Output

To specifically analyze diamonds with prices over 6000, a subset function was
applied to filter the dataset based on the price condition. The resulting subset of data
was then used to create a histogram, focusing on the distribution of carat values for
higher-priced diamonds. This histogram provides insights into the carat weight
distribution within the selected subset of diamonds that have prices exceeding 6000.
d. Study the relationship between the diamond’s weight (carat) and its price (price).
Code
ggplot(diamonds,aes(x = carat,y = price))+geom_point()
Output

By creating a scatter plot, we aimed to investigate the correlation between diamond


carat weight and price. Each individual diamond in the dataset was represented by a
data point, and the position of each point on the plot was determined by its carat
weight and price values. This scatter plot offered a comprehensive visualization of the
relationship between carat weight and price, allowing for an examination of the
overall association between these two variables.
e. Study the relationship between the quality of the cut (cut) and the diamond color
(color).
Code
ggplot(diamonds,aes(x = cut,fill = color))+geom_bar()
Output

Using a bar plot, we visualized the distribution of diamond cuts based on their color.
Each cut category was represented by a separate bar, and within each bar, segments
were used to indicate different colors. This plot provided a clear visualization of the
combination of cut and color within the dataset, allowing for a better understanding of
how these attributes are distributed among the diamonds.

f. Study the relationship between the quality of the cut (cut) and the price (price).
Code
ggplot(diamonds,aes(x = cut,y=price))+geom_boxplot()
Output
To examine the distribution of prices across various diamond cuts, a box plot was
created. This plot displayed essential statistical measures such as the minimum,
maximum, median, and quartile values for each cut category. By visualizing these
measures, the box plot offered valuable insights into the variation of prices among
different diamond cuts.

2. Create a new vector with the following data: 1,2,3,4,NA,6,7,8,NA,NA. NA means ‘Not
Available’ / Missing Values. Use min, max, and mean functions to get the minimum,
maximum, and average, respectively for this vector. Try using the argument
na.rm=TRUE with these three functions and re-print the results.
Code
vec<- c(1,2,3,4,NA,6,7,8,NA,NA)

#without NA remove
cat("Minm",min(vec))
cat("Maxm",max(vec))
cat("mean",mean(vec))

#with NA removed
cat("Minm",min(vec,na.rm = TRUE))
cat("Maxm",max(vec,na.rm = TRUE))
cat("mean",mean(vec,na.rm = TRUE))
Output

Given a vector named vec, which includes numeric values as well as missing values represented
as NA, calculations were performed on this vector as follows:

a. Initially, without removing the NA values, the minimum, maximum, and mean values of the
vector were computed using the min, max, and mean functions, respectively. These calculations
considered all values in the vector, including the NA values.

b. Next, by excluding the NA values using the na.rm = TRUE argument, the same calculations
were repeated. This allowed for the determination of the minimum, maximum, and mean values
considering only the available numeric values in the vector, excluding the NA values.

Conclusion
In summary, this analysis encompassed the visualization of the diamonds dataset using different
plots, enabling us to gain valuable insights into the distribution of cuts, carat weights, prices, and
color combinations. Furthermore, calculations were conducted on a vector containing NA values,
with and without their removal, to compare statistical summaries. These exploratory tasks
facilitated a comprehensive examination and analysis of the dataset, revealing patterns,
relationships, and summary statistics pertaining to diamonds.

You might also like