SMDM Project - Business Report - R
SMDM Project - Business Report - R
BUSINESS REPORT
SMDM Project 1
2
Executive Summary:
The main objective of this ‘Wholesale Customer Analysis’ is to perform the detailed
exploratory data analysis by exploring the dataset using central tendency (Mean, Median &
Mode), spread analysis and other parameters. In order to provide solutions to the wholesale
distributor business problem this in turn will increase the profit.
Dataset Summary:
Problem Statement:
1.1 Use methods of descriptive statistics to summarize data. Which Region and
which Channel spent the most? Which Region and which Channel spent the least?
Solution:
Descriptive statistics describe, show and summarize the basic features of a dataset found in
a given study, presented in a summary that describe the data sample and its
measurements. The most recognized type of descriptive statistics are measures of central
tendency: Mean, Mode and the Median, which are used to solve business problems.
From the descriptive statistics, we can see that there are two unique types of Channels
Hotel and Retail and three unique types of region available in the dataset. From these
Hotel is the most frequent
Which Region and Which Channel has spent the most and Least?
Figure 4.1
4
Conclusion: From the above data other Region has spent the most and least is Oportho and
From the Channel, Retail is least and Hotel is the most .
As per the Figure-4.1, by grouping Region with Total Spend, we can come to a
conclusion that
❖ Region “Other” has the highest spend and the amount is $10.7 M
❖ Region “Oporto” has the least spend and the amount is $1.5 M
❖ Hotel Channel has the highest spend and the amount is $7.9 M
❖ Retail Channel has the least spend and the amount is $6.6 M
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer.
Solution:
We get the following charts and values for all the 6 different items across Region and
Channel
5
Fig 5.1
From the above chart and table it is evident that the “Hotel” channel has the highest spend
across all the regions versus the “Retail” channel for fresh items.
Fig : 5.2
From the above graph and table it’s evident that “Retail” channel has the highest spend in
Oporto and Others region for Milk Item. However, for Lisbon region still “Hotel” channel
has the highest spend for Milk Item
Fig:6.1
6
From the above graph, it’s evident that the “Hotel” channel has the highest spend in all the
regions for Frozen Item versus the “Retail” channel.
Fig: 6.2
From the above graph (6.2) and table it’s evident that the “Hotel” channel has the highest
spend in all the regions for Delicatessen Item versus the “Retail” channel.
1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent
behaviour?
7
Solution:
Inference:
As per the above figure, using python we calculated the Coefficient of variation for all
the six different items and its evident that item “Fresh” has the lower coefficient of
variation (i.e. with value 1.05) which means “Fresh” item shows the least
inconsistent behavior. Similarly, item “Delicatessen” has the higher coefficient of
variation (i.e. with value 1.85) which means “Delicatessen” item shows the most
inconsistent behavior.
1.4 Are there any outliers in the data? Backup your answer with a suitable
plot/technique with the help of detailed comments.
8
Figure 8.1 – Output from Python grouped by Regions & Channels with Item- Fresh
Above box plot graph shows that all the six different Items in the given sample
dataset have the outliers. Also, we can say that Fresh items have maximum spend
across regions & Channels while Delicatessen items have lowest spend across the
regions & channels.
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective
As per the analysis, we could find that spending of Hotel and Retail channels are
different so need focus in order to get it equal. And also spending should be equal for
different regions. There are inconsistencies in spending of different items by
calculating Coefficient of Variation as shown in Fig:8.2 below, which should be
minimized so that each item will have consistent behavior.
Fig: 8.2
Problem 2
Introduction:
9
The Student news service at Clear Mountain University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 under graduates (Stored in a CSV
file)
Executive Summary:
The main objective of this CMSU Survey analysis is to provide an explanatory and
relative approach for each of the below listed problems.
2.1. For this data, construct the following contingency tables (Keep Gender as
row variable)
Fig above shows the contingency tables for Gender Vs Grad Intention
10
Fig above shows the contingency tables for Gender Vs Employment
2.2.1. What is the probability that a randomly selected CMSU student will be
male?
Sol:
From the given dataset, we identified that count of male students are 29 out of 62 students,
so the probability that a randomly selected CMSU student will be “Male” is “46.77%”
2.2.2. What is the probability that a randomly selected CMSU student will be
female?
Sol:
From the given dataset, we identified that count of Female students are 33 out of 62
students, so the probability that a randomly selected CMSU student will be
“Female” is “53.23%”
11
2.3.1. Find the conditional probability of different majors among the male
students in CMSU.
Based on the above output it’s clear that CMSU “Male” students mostly opt/prefer
for “Management” as major and lease prefer for “CIS” major.
2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
Python code with calculation to prove
the problem
2.4. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
13
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.
Probability that a randomly selected student is a female and does NOT have a laptop
= Probability that a randomly chosen student is a Female * Probability of Female with No
Laptop
The probability that a randomly selected student is a female and does NOT have a
laptop is 6.45
2.5. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment?
Probability that a randomly chosen student is either a male or has full-time employment
=Probability of a Student being Male + Probability of a student having FullTime Employment
Probability of a Male having FullTime Employment
The probability that a randomly chosen student is either a male or has a full-time employment
79.87 %
15
2.5.2. Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management.
The conditional probability that given a female student is randomly chosen, she is majoring
in international business or management is 24.242 %
In this case if being female and graduate intention are independent can be proven by checking the
condition :
P(F n Yes) = P(F) * P(Yes)
Where F = Female
Yes = Grad Intention being Yes
Hence, Graduate intention and being female are not independent events
2.7. Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending, and Text Messages.
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is
less than 3?
Since the GPA is a continuous variable the probability of a student whose GPA is less than 3
can be calculated by using the Poisson Distribution.
To calculate the probability of GPA 3 or less we will add the prob of 0,1,2 and 3 GPA
obtained in the Poisson Distribution.
If a student is chosen randomly, what is the probability that his/her GPA is less than 3
is 39.49%
2.7.2. Find the conditional probability that a randomly selected male earns 50 or
more. Find the conditional probability that a randomly selected female earns 50 or
more.
17
The above distplot represents the salary of all the Male in the population.
As we can see it is normally distributed hence the conditional probability that a randomly
selected male earns 50 or more can be calculated using the Normal distribution.
To calculate this, we will calculate the cumulative probability for less than 50 using Normal
Distribution and then will subtract from 1.
Hence from the calculations done in Python we conclude that :
The Conditional probability that a randomly selected male earns 50 or more is 83.04 %
Find the conditional probability that a randomly selected female earns 50 or more.
The above distplot represents the salary of all the females in the population.
As we can see it is normally distributed hence the conditional probability that a randomly selected female earns
50 or more can be calculated using the Normal distribution.
To calculate this, we will calculate the cumulative probability for less than 50 using Normal Distribution and
then will subtract from 1.
The Conditional probability that a randomly selected Female earns 50 or more is 86.09 %
18
2.8. Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending, and Text Messages. For each of them, comment whether they
follow a normal distribution. Write a note summarizing your conclusions.
Salary GPA
From the above histograms for the continuous variables GPA, Salary, Spending and Text
Messages we can see that :
● GPA is almost Normally Distributed with a slight skewness toward the left.
● Salary is also Normally Distributed with a slight skewness towards the right.
● Spending is not Normally distributed and highly Right Skewed
● Text message is not Normally distributed and highly Right Skewed.
Problem 3
Introduction
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that
they have purchased a product lacking in quality if they find moisture and wet shingles
inside the packaging. In some cases, excessive moisture can cause the granules attached
19
to the shingles for texture and coloring purposes to fall off the shingles resulting in
appearance problems.
Executive Summary:
To monitor the amount of moisture present, the company conducts moisture tests. A
shingle is weighed and then dried. The shingle is then reweighed, and based on the amount
of moisture taken out of the product; the pounds of moisture per 100 square feet are
calculated. The company would like to show that the mean moisture content is less than
0.35 pounds per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for
A shingles and 31 for B shingles.
3.1 Do you think there is evidence that means moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly
showing all steps.
Solution:
As per the problem statement there are two types of shingles i.e. A Shingles & B Shingles so
let us take each shingle separately for this analysis.
For A Shingles
Assumption:
Null Hypothesis (Ho) = Mean moisture contents in A Shingles is not less than permissible limits
Alternate Hypothesis (H1) = Mean moisture contents in A Shingles is less than permissible limits
One sample t-test is used to determine whether an unknown population mean is different from a
specific value. Hence using python, we calculate the t-value and p-value as shown below
Input:
20
Output:
Conclusion:
Since p-value is 0.0748 which is greater than 0.05 so we cannot reject the null hypothesis.
Hence, we can conclude that there is not enough evidence that content for Sample A
shingles is less than permissible limits i.e. less than 0.35 pounds per 100 square feet. If the
population mean moisture content is in fact no less than 0.35 pounds per 100 square feet,
the probability of observing a sample of 36 A shingles will result in a sample mean moisture
content of 0.3167 pounds per 100 square feet or less is 0.0748.
For B Shingles
Assumption:
Null Hypothesis (Ho) = Mean moisture contents in B Shingles is not less than permissible limits
Alternate Hypothesis (H1) = Mean moisture contents in B Shingles is less than permissible limits
One sample t-test is used to determine whether an unknown population mean is different from a
specific value. Hence using python, we calculate the t-value and p-value as shown below
Input:
Output
21
Conclusion:
Since p-value is 0.0021 which is lesser than 0.05 so we can reject the null hypothesis.
Hence, we can conclude that there is enough evidence that content for Sample B shingles is
less than permissible limits i.e. less than 0.35 pounds per 100 square feet. If populations
mean moisture content is in fact less than 0.35 pounds per 100 square feet, the probability
of observing a sample of 31 B shingles will result in a sample mean moisture content of
0.2735 pounds per 100 square feet or less is 0.0021.
3.2 Do you think that the population mean for shingles A and B are
equal? Form the hypothesis and conduct the test of the hypothesis.
What assumption do you need to check before the test for equality
of means is performed?
To perform a Test of equality of the population mean of the A shingles and B shingles, the
null and alternative hypothesis to test whether the population mean moisture content is
equal is given
We have two samples A and B and we do not know the population standard deviation.
The samples are not large samples. So you use the t distribution and the tSTAT test statistic
Since we are testing for equality between sample A and B we use two sample T tests.
Therefore, It can be concluded that the population mean for shingles A and B are equal