0% found this document useful (0 votes)
54 views18 pages

Mini Project - Cold Storage Case Study

The document describes exploring a cold storage temperature dataset using R. It outlines importing the data, identifying variables, performing univariate analysis to examine individual variables, bi-variate analysis to examine relationships between two variables, and finding there is one outlier in temperature. Temperature is recorded daily across months and seasons, with the mean temperature highest in summer and lowest in winter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views18 pages

Mini Project - Cold Storage Case Study

The document describes exploring a cold storage temperature dataset using R. It outlines importing the data, identifying variables, performing univariate analysis to examine individual variables, bi-variate analysis to examine relationships between two variables, and finding there is one outlier in temperature. Temperature is recorded daily across months and seasons, with the mean temperature highest in summer and lowest in winter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Mini Project – Cold Storage Case Study

Project Report
Table of Contents
1 Project Objective............................................................................................................................. 3
2 Assumptions.................................................................................................................................... 3
3 Exploratory Data Analysis – Step by step approach ....................................................................... 3
3.1 Environment Set up and Data Import ..................................................................................... 3
3.1.1 Install necessary Packages and Invoke Libraries ............................................................. 3
3.1.2 Set up working Directory ................................................................................................ 3
3.1.3 Import and Read the Dataset.......................................................................................... 4
3.2 Variable Identification ............................................................................................................ 4
3.2.1 Variable Identification – Inferences ................................................................................ 4
3.3 Univariate Analysis.................................................................................................................. 4
3.4 Bi-Variate Analysis .................................................................................................................. 5
3.5 Missing Value Identification ................................................................................................... 5
3.6 Outlier Identification............................................................................................................... 5
3.7 Variable Transformation / Feature Creation .......................................................................... 5
4 Conclusion....................................................................................................................................... 5
5 Appendix A – Source Code .............................................................................................................. 5
1 Project Objective
The objective of the report is to explore the cold storage data sets (“Cold_Storage_Mar2018” and
“Cold_Storage_Temp_Data (1)” ) in R and generate insights about the data set. This exploration
report will consist of the following:

• Importing the dataset in R


• Understanding the structure of dataset
• Graphical exploration
• Descriptive statistics
• Insights from the dataset

2 Assumptions
Assumptions are as below:
1. We assume that the temperatures recorded are normally distributed.
2. The cold storage plant is careful about recording the temperatures everyday hence not
expecting null values.
3. The temperatures are always maintained at optimum level at the cold storage to avoid
contamination of products.

3 Exploratory Data Analysis – Step by step approach


A Typical Data exploration activity consists of the following steps:

1. Environment Set up and Data Import


2. Variable Identification
3. Univariate Analysis
4. Bi-Variate Analysis
5. Missing Value Treatment (Not in scope for our project)
6. Outlier Treatment (Not in scope for our project)
7. Variable Transformation / Feature Creation
8. Feature Exploration

We shall follow these steps in exploring the provided dataset.

Although Steps 5 and 6 are not in scope for this project, a brief about these steps (and other
steps as well) is given, as these are important steps for Data Exploration journey.

3.1 Environment Set up and Data Import

3.1.1 Install necessary Packages and Invoke Libraries


Along with R base packages below libraries and packages are used
install.packages("dplyr")
library(dplyr)
install.packages("ggplot2")
library(ggplot2)

5|Page
3.1.2 Set up working Directory
Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project.
Please refer Appendix A for Source Code.

3.1.3 Import and Read the Dataset


The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the
file.

Please refer Appendix A for Source Code.

3.2 Variable Identification


<Specify which R functions you are using for what purpose in brief. >

1. Dim() : Used for checking the number of rows and columns


2. Head(): To list out the top rows in the data set
3. Tail(): To list out the bottom rows in the data set
4. Str(): To find out the data types of the variables whether they are numerical or categorical
5. Summary(): is used to check out the below items:
a. To check out the basic statistics of the dataset like mean, quartiles and see how
spread out the data is.
b. To find out if there is any possibility for the outliers for numerical variables.
c. To find out if there are is any possibility for null values.
6. anyNA(): To find out if there any missing values in the entire dataset.
7. Boxplot(): To find out if there are any outliers in the dataset.
8. Attach(): To avoid repetition of using the dataframe name while accessing the column
9. Select(): For selecting few columns from dataframe to display in the output
10. Filter(): For filtering out the data basing on the needed condition
11. Summarise(): To perform summary functions on the columns
12. By(): For grouping of the data based on a column to find the 5 statistics
13. Colnames(): To find out the column names of the dataset
14. Max() and Min(): To find out the maximum and minimum values of a column
15. Table(): to find out the observation counts for each categorical variable
16. Plot(): for plotting graph for categorical variables on the graph.
17. Barplot(): to plot continuous variables on the graph.
18. Mean(): to find the average of the observation
19. Sd(): to find out the standard deviation of the variables

3.2.1 Variable Identification – Inferences

1. The temperature of cold storage is given for each day through the entire year along with
month and season details.
2. “Temperature” is a Quantitative and continuous variable.
3. From the structure we see 2 Nominal- categorical variables are present as below:
a. “Season” variable is a factored variable having 3 level of factors naming Summer, Winter
and rainy.
b. “Month” variable is a factored variable having 12 levels of factors with all the month
names from January till December
6|Page
4. The minimum temperature maintained at Cold Storage is 1.7 and maximum temperature is
5.0
5. The number of days in each season is approximately the same.
6. There are no missing values in any of the columns in the dataset.
7. The Seasons variables gives info of all the seasons in the entire year
8. The month variable has data from Jan-DecThe date variable is having dates of respective
month

3.3 Univariate Analysis

1. There is one outlier value in temperature recorded in throughout the year.

2. February has 28 days. Hence the year in which this data was recorded is a non-leap year.

3. The temperature is recorded only in Winter, Rainy and Summer Seasons.

7|Page
4. As the total number of observations are 365, temperature is recorded on all the days
throughout the year.

5. Histogram of the temperature through the out the year

3.4 Bi-Variate Analysis

1. The mean temperature maintained at cold storage is more in summer and less in winter.
2. There are few abnormal temperatures recorded in Rainy and Winter season.
3. Temperature in summer is maintained at optimum value.

8|Page
4. The least and highest temperature is recorded in September during rainy season.

5. Below 2 C temperature was recorded twice in the month of September (Rainy Season) and
November (Winter Season)

6. Above 4 C temperature happened many number of times during Rainy Season


7. Below 2 C temperature was recorded once in Winter and once in Rainy season.

8. Month vs Temperature

9|Page
9. Season vs Temperature
Temperature was recorded unevenly more in Winter

3.5 Missing Value Identification

No Values are missing from the entire dataset

3.6 Outlier Identification

One outlier is found from the entire dataset in the temperature column.

3.7 Variable Transformation / Feature Creation


Not Applicable

4 Conclusion
Temperature at cold storage started falling apart from September month. September, October, January
are the months where there is a negligence that was noticed in the maintenance of temperature due to
which dairy products started getting contaminated. In summer and Rainy season no complaints would
have been received as there was due diligence except on one day in Rainy season.
The average temperature maintained at Cold storage is low in Winter and high in Summer.
Immediate attention is needed for placing the corrective measures at the Cold Storage plant to avoid
complaints from customers.

10 | P a g
e
5 Appendix A – Source Code

Cold Storage Project


Setting the work directory and Importing the file :
setwd("C:/Users/ammu/Desktop/Great Lakes/2. Statistical Method Decision Makin
g/Project")
ColdStorage1=read.csv("Cold_Storage_Temp_Data.csv")
attach(ColdStorage1)

Importing the package:


library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.2

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

library(ggplot2)
library(lattice)

## Warning: package 'lattice' was built under R version 3.6.2

Descriptive Statistics:
View(ColdStorage1)
dim(ColdStorage1)

## [1] 365 4

summary(ColdStorage1)

## Season Month Date Temperature


## Rainy :122 Aug : 31 Min. : 1.00 Min. :1.700
## Summer:120 Dec : 31 1st Qu.: 8.00 1st Qu.:2.500
## Winter:123 Jan : 31 Median :16.00 Median :2.900
## Jul : 31 Mean :15.72 Mean :2.963
## Mar : 31 3rd Qu.:23.00 3rd Qu.:3.300
## May : 31 Max. :31.00 Max. :5.000
## (Other):179

str(ColdStorage1)

## 'data.frame': 365 obs. of 4 variables:


## $ Season : Factor w/ 3 levels "Rainy","Summer",..: 3 3 3 3 3 3 3 3 3
3 ...

11 | P a g
e
## $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 5 5 5 5 5 5 5
5 5 ...
## $ Date : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Temperature: num 2.4 2.3 2.4 2.8 2.5 2.4 2.8 2.3 2.4 2.8 ...

head(ColdStorage1)

## Season Month Date Temperature


## 1 Winter Jan 1 2.4
## 2 Winter Jan 2 2.3
## 3 Winter Jan 3 2.4
## 4 Winter Jan 4 2.8
## 5 Winter Jan 5 2.5
## 6 Winter Jan 6 2.4

tail(ColdStorage1)

## Season Month Date Temperature


## 360 Winter Dec 26 2.7
## 361 Winter Dec 27 2.7
## 362 Winter Dec 28 2.3
## 363 Winter Dec 29 2.6
## 364 Winter Dec 30 2.3
## 365 Winter Dec 31 2.9

colnames(ColdStorage1)

## [1] "Season" "Month" "Date" "Temperature"

anyNA(ColdStorage1)

## [1] FALSE

Exploratory Analysis:
a.Univariate Analysis:
boxplot(Temperature,xlab="Temperature",col="green",horizontal = TRUE,main="Te
mperature")

12 | P a g
e
There is only one outlier recorded considering the data in the entire year.

hist(Temperature, xlab = "Temperature", col = "red")

The Temperature is following a left skewed distribution

plot(Month, main="Months Chart")

13 | P a g
e
plot(Season,col=c("green","blue","red"),main="Season Chart")

There seasons are recorded in the entire year.


table(Month)

## Month
## Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
## 30 31 31 28 31 31 30 31 31 30 31 30

14 | P a g
e
table(Season)

## Season
## Rainy Summer Winter
## 122 120 123

b. Bivariate Analysis
boxplot(Temperature~Season,col=c("green","blue","red"))

The relative average temperature in winter is low and summer is high. There are few outliers noticed in
winter and Rainy season

ggplot(ColdStorage1, aes(x = Season, y = Temperature, fill = Month)) + geom_b


oxplot() + ggtitle("Season vs Temperature")

15 | P a g
e
High Negligence is noticed in maintaining temperature at cold storage during winter season

ggplot(ColdStorage1, aes(x = Month, y = Temperature, fill = Season)) + geom_b


oxplot() + ggtitle("Months vs Temperture")

In the month of Jan, Cold storage would have started receiving complaints because as we see there are
more outliers

16 | P a g
e
qplot(Month,Temperature,data=ColdStorage1,col=Season,Main="Temperature across
months")

## Warning: Ignoring unknown parameters: Main

qplot(Season,Temperature,data=ColdStorage1,col=Season)

17 | P a g
e
Assignment Questions:
1. Problem 1:

1.1 Find mean cold storage temperature for Summer, Winter and Rainy Season
WinterTemp=ColdStorage1%>% select(Season,Temperature) %>% filter(Season=="Win
ter")%>% summarise(mean=mean(Temperature))
SummerTemp=ColdStorage1%>% select(Season,Temperature) %>% filter(Season=="Sum
mer")%>% summarise(mean=mean(Temperature))
RainyTemp=ColdStorage1%>% select(Season,Temperature) %>% filter(Season=="Rain
y")%>% summarise(mean=mean(Temperature))
WinterTemp

## mean
## 1 2.700813

SummerTemp

## mean
## 1 3.153333

RainyTemp

## mean
## 1 3.039344

Ans: Mean of Temperatures for respective seasons are as below:


Mean of Winter Season Temperature is 2.700813
Mean of Summer Season Temperature is 3.153333
Mean of Rainy Season Temperature is 3.039344

1.2 Find overall mean for the full year

MeanforYear=mean(Temperature)
MeanforYear

## [1] 2.96274

Ans: Mean of Overall Temperatures throughout the year is 2.96274

1.3 Find Standard Deviation for the full year

SdforYear=sd(Temperature)
SdforYear

## [1] 0.508589

Ans: Standard deviation of the temperature though out the year is 0.508589

1.4 Assume Normal distribution, what is the probability of temperature having fallen
below 2 deg C?

lessThan2C=2
normMean=mean(ColdStorage1$Temperature)
normSD=sd(ColdStorage1$Temperature)

18 | P a g
e
below2Cprob=pnorm(lessThan2C,normMean,normSD)
below2Cprob

## [1] 0.02918146

Ans: The probability of temperature having fallen below 2 deg C 0.02918146

1.5 Assume Normal distribution, what is the probability of temperature having gone
above 4 C?

moreThan4C=4
above4Cprob=1-pnorm(moreThan4C,normMean,normSD)
above4Cprob

## [1] 0.02070077

Ans: The probability of temperature having gone above 4 C 0.02070077

1.6 What will be the penalty for the AMC Company?

totalProb=below2Cprob+above4Cprob
Penalty=totalProb*100
Penalty

## [1] 4.988223

Ans: It was given that “probability of temperature going outside the 2 - 4 C during the one-year
contract was above 2.5% and less than 5% then the penalty would be 10% of AMC (annual
maintenance case).”
Since the calculated probability of temperature going outside 2 - 4 C is 4.988223 which is equal to
4.99% approximately, the penalty would be 10% of AMC (Annaual Maintenance Case)

Problem 2:
1.Which Hypothesis test shall be performed to check the if corrective action is needed at the
cold storage plant? Justify your answer?
Ans: We have to perform single sample right tail t-test to check if there is corrective action
needed at cold storage plant.
Single Sample: since we have only one variable called temperature. Right tail: since we are testing
hypothesis at greater sign T- test: since we have the sample mean, and n>30 and the population
standard deviation is not known. Hence we cannot perform Z t-test but only T- test.

2. State the Hypothesis, perform hypothesis test and determine p-value?


Ans:
Null Hypothesis: Null hypothesis is something which is the commonly occuring scenario or believed to
be TRUE Here the commonly believed fact is that the cold storage is maintaining optimum temperate,
i.e <=3.9
Hence Null hypothesis is: meanTemp<=3.9

19 | P a g
e
Alternate Hypothesis: This is something which are trying to prove or which is deviation from normal.
Here since the customers are complaning that dairy products are going sour and often smelling. This will
happen only if the temperature is maintained greater than 3.9
Hence Alternate Hypothesis: meanTemp>3.9

p-value Calculation:
setwd("C:/Users/ammu/Desktop/Great Lakes/2. Statistical Method Decision Makin
g/Project")
ColdStorage2=read.csv("Cold_Storage_Mar2018.csv")
attach(ColdStorage2)

## The following objects are masked from ColdStorage1:


##
## Date, Month, Season, Temperature

MeanTemp2=mean(Temperature)
MeanTemp2

## [1] 3.974286

SdTemp2=sd(Temperature)
SdTemp2

## [1] 0.159674

n=35
alpha=0.1
confLevel=1-0.1

t.test(Temperature,mu=3.9,alternative = "greater", paired = FALSE,conf.level


= confLevel )

##
## One Sample t-test
##
## data: Temperature
## t = 2.7524, df = 34, p-value = 0.004711
## alternative hypothesis: true mean is greater than 3.9
## 90 percent confidence interval:
## 3.939011 Inf
## sample estimates:
## mean of x
## 3.974286

Ans: Here p-value calculated p-value is = 0.004711 and given alpha is =0.1
Since the p-value is low, null hypothesis will be rejected.

3. Give your inference.


The problem is at the Cold Storage Plant and not from the procurement side. As per our
hypothesis test, we are able to prove that mean temperature is not being maintained at 3.9 C
or below with 90% confidence level. Since the sample size is greater >30, this is good enough to
predict that a similar outcome will happen though the population of 1 year data is considered.

20 | P a g
e

You might also like