100% found this document useful (2 votes)
89 views21 pages

SMDM Project - Business Report - R

This document provides a statistical analysis of survey data from Clear Mountain State University (CMSU). It includes contingency tables analyzing relationships between gender and major, graduation intentions, employment status, and computer usage. Probabilities are calculated for a randomly selected student's gender. Conditional probabilities of majors among male students are also found. The analysis aims to help CMSU understand its student population based on this survey data.

Uploaded by

hepzi selvam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
89 views21 pages

SMDM Project - Business Report - R

This document provides a statistical analysis of survey data from Clear Mountain State University (CMSU). It includes contingency tables analyzing relationships between gender and major, graduation intentions, employment status, and computer usage. Probabilities are calculated for a randomly selected student's gender. Conditional probabilities of majors among male students are also found. The analysis aims to help CMSU understand its student population based on this survey data.

Uploaded by

hepzi selvam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Statistical Methods for

Decision Making (SMDM)


Project

BUSINESS REPORT

M.P Surender Nath

PGP-DSBA Online – October’21 batch

Date: 12th December 21

SMDM Project 1
2

Wholesale Customers Analysis


Introduction:

A wholesale distributor operating in different regions of Portugal has information on


annual spending of several items in their stores across different regions and channels. We
have taken a sample dataset consisting of 440 large retailers’ annual spending on 6
different varieties of products they have in 3 different regions (Lisbon, Oporto, Other) and
across different sales channels (Hotel, Retail).

Executive Summary:

The main objective of this ‘Wholesale Customer Analysis’ is to perform the detailed
exploratory data analysis by exploring the dataset using central tendency (Mean, Median &
Mode), spread analysis and other parameters. In order to provide solutions to the wholesale
distributor business problem this in turn will increase the profit.

Dataset Summary:

Sample dataset consists of 440 rows (samples) & 9 columns (Buyer/Spender,


Channel, Region, Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicatessen) as shown in
the below table
3
There are totally 440 rows and 9 columns available in the sample dataset. Out of 9 columns,
7 columns are of Integer type and remaining 2 columns are of Object type. Also, there are
no Null /Missing values present in the sample dataset.

Problem Statement:

1.1 Use methods of descriptive statistics to summarize data. Which Region and
which Channel spent the most? Which Region and which Channel spent the least?

Solution:

Descriptive statistics describe, show and summarize the basic features of a dataset found in
a given study, presented in a summary that describe the data sample and its
measurements. The most recognized type of descriptive statistics are measures of central
tendency: Mean, Mode and the Median, which are used to solve business problems.

From the descriptive statistics, we can see that there are two unique types of Channels
Hotel and Retail and three unique types of region available in the dataset. From these
Hotel is the most frequent

Which Region and Which Channel has spent the most and Least?

Figure 4.1
4

Conclusion: From the above data other Region has spent the most and least is Oportho and
From the Channel, Retail is least and Hotel is the most .

As per the Figure-4.1, by grouping Region with Total Spend, we can come to a
conclusion that

❖ Region “Other” has the highest spend and the amount is $10.7 M
❖ Region “Oporto” has the least spend and the amount is $1.5 M

Similarly, by grouping Channel with Total Spend

❖ Hotel Channel has the highest spend and the amount is $7.9 M
❖ Retail Channel has the least spend and the amount is $6.6 M

1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer.

Solution:

We get the following charts and values for all the 6 different items across Region and
Channel
5

Fig 5.1

From the above chart and table it is evident that the “Hotel” channel has the highest spend
across all the regions versus the “Retail” channel for fresh items.

Fig : 5.2

From the above graph and table it’s evident that “Retail” channel has the highest spend in
Oporto and Others region for Milk Item. However, for Lisbon region still “Hotel” channel
has the highest spend for Milk Item

Fig:6.1
6

From the above graph, it’s evident that the “Hotel” channel has the highest spend in all the
regions for Frozen Item versus the “Retail” channel.

Fig: 6.2

From the above graph (6.2) and table it’s evident that the “Hotel” channel has the highest
spend in all the regions for Delicatessen Item versus the “Retail” channel.

1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent
behaviour?
7

Solution:

Inconsistent behavior of the Item can be obtained by measuring coefficient of


variance for each of the items listed in the sample dataset. The coefficient of
variation (CV) is the ratio of the standard deviation to the mean. The higher the
coefficient of variation, the greater the level of dispersion around the mean and lower
is the consistency of the data and vice versa.

Inference:

As per the above figure, using python we calculated the Coefficient of variation for all
the six different items and its evident that item “Fresh” has the lower coefficient of
variation (i.e. with value 1.05) which means “Fresh” item shows the least
inconsistent behavior. Similarly, item “Delicatessen” has the higher coefficient of
variation (i.e. with value 1.85) which means “Delicatessen” item shows the most
inconsistent behavior.

1.4 Are there any outliers in the data? Backup your answer with a suitable
plot/technique with the help of detailed comments.
8

Figure 8.1 – Output from Python grouped by Regions & Channels with Item- Fresh

Above box plot graph shows that all the six different Items in the given sample
dataset have the outliers. Also, we can say that Fresh items have maximum spend
across regions & Channels while Delicatessen items have lowest spend across the
regions & channels.

1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective

As per the analysis, we could find that spending of Hotel and Retail channels are
different so need focus in order to get it equal. And also spending should be equal for
different regions. There are inconsistencies in spending of different items by
calculating Coefficient of Variation as shown in Fig:8.2 below, which should be
minimized so that each item will have consistent behavior.

Fig: 8.2

Problem 2

Clear Mountain State University Survey Analysis

Introduction:
9
The Student news service at Clear Mountain University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 under graduates (Stored in a CSV
file)

Executive Summary:

The main objective of this CMSU Survey analysis is to provide an explanatory and
relative approach for each of the below listed problems.

2.1. For this data, construct the following contingency tables (Keep Gender as
row variable)

Fig above shows the contingency tables for Gender Vs Major

Fig above shows the contingency tables for Gender Vs Grad Intention
10
Fig above shows the contingency tables for Gender Vs Employment

Fig above shows the contingency tables for Gender Vs Computer

2.2.1. What is the probability that a randomly selected CMSU student will be
male?

Sol:

From the given dataset, we identified that count of male students are 29 out of 62 students,
so the probability that a randomly selected CMSU student will be “Male” is “46.77%”

2.2.2. What is the probability that a randomly selected CMSU student will be
female?

Sol:

From the given dataset, we identified that count of Female students are 33 out of 62
students, so the probability that a randomly selected CMSU student will be
“Female” is “53.23%”
11

2.3. Assume that the sample is representative of the population of CMSU.


Based on the data, answer the following question:

2.3.1. Find the conditional probability of different majors among the male
students in CMSU.

Based on the above output it’s clear that CMSU “Male” students mostly opt/prefer
for “Management” as major and lease prefer for “CIS” major.

Probability of Accounting among the male students = 4/29


Probability of CIS among the male students = 1 / 29
Probability of Economics/Finance among the male students = 4 /29
Probability of International Business among the male students = 2/29
Probability of Management among the male students Management = 6/29
Probability of Other among the male students Other = 4/29
Probability of Retailing/Marketing among the male students = 5/29
Probability of Undecided among the male students = 3/29

Hence from the calculations done in Python we conclude that:

The Probability of Accounting among the male students is : 13.793103448275861 %


The Probability of CIS among the male students is : 3.4482758620689653 %
The Probability of Economics/Finance among the male students 13.793103448275861 %
The Probability of International Business among the male students is: 6.896551724137931 %
he Probability of Management among the male students is: 20.689655172413794 %
he Probability of Others among the male students is: 13.793103448275861 %
he Probability of Retail among the male students is: 17.24137931034483 %
he Probability of Undecided stream among the male students is: 10.344827586206897 %
12

2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
Python code with calculation to prove

the problem

The Probability of Accounting among the female students is : 9.090909090909092 %


The Probability of CIS among the female students is : 9.090909090909092 %
The Probability of Economics/Finance among the female students 21.21212121212121 %
The Probability of International Business among the female students is: 12.121212121212121 %
The Probability of Management among the female students is: 12.121212121212121 %
The Probability of Others among the female students is: 9.090909090909092 %
The Probability of Retail among the female students is: 27.27272727272727 %
The Probability of Undecided stream among the female students is: 0.0 %

2.4. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
13
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.

Probability that a randomly chosen student is a Male =29/62

Probability of Male that intends to Graduate =17/62

Probability a randomly chosen student is a male and intends to graduate


= Probability that a randomly chosen student is a Male * Probability that a randomly chosen
student is a Male

Hence from the above calculation done in Python we conclude that:


The probability that a randomly chosen student is a male and intends to graduate is 27.42%

2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.

Contingency table for Gender and computer:


14

Probability that a randomly chosen student is a Female = 33/62


Probability of Female with No Laptop = 1-(29/33)

Probability that a randomly selected student is a female and does NOT have a laptop
= Probability that a randomly chosen student is a Female * Probability of Female with No
Laptop

Hence from the calculations done in Python we conclude that :

The probability that a randomly selected student is a female and does NOT have a
laptop is 6.45

2.5. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:

2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment?

Probability of a Student being Male = 29/33


Probability of a student having FullTime Employment = 10/62
Probability of a Male having FullTime Employment = 7/29

Probability that a randomly chosen student is either a male or has full-time employment
=Probability of a Student being Male + Probability of a student having FullTime Employment
Probability of a Male having FullTime Employment

Hence from the calculations done in Python we conclude that :

The probability that a randomly chosen student is either a male or has a full-time employment
79.87 %
15
2.5.2. Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management.

Probability of international business given Female = 4/33


Probability of management given Female = 4/33

Since international business and management are independent of each other

Probability of international business or management given Female


= Probability of international business given Female + Probability of management given
Female

Hence from the calculations done in Python we conclude that :

The conditional probability that given a female student is randomly chosen, she is majoring
in international business or management is 24.242 %

2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels


(Yes/No). The Undecided students are not considered now and the table is a 2x2
table. Do you think the graduate intention and being female are independent
events?
16
wo events A and B can be proved to be Independent events when it satisfies the condition :
P(A ∩ B) = P(A) * P(B)

In this case if being female and graduate intention are independent can be proven by checking the
condition :
P(F n Yes) = P(F) * P(Yes)
Where F = Female
Yes = Grad Intention being Yes

Hence from the calculations done in Python we conclude that :


P(F ∩ Yes) ≠ P(F) * P(Yes)

Hence, Graduate intention and being female are not independent events

2.7. Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending, and Text Messages.

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is
less than 3?

Since the GPA is a continuous variable the probability of a student whose GPA is less than 3
can be calculated by using the Poisson Distribution.

To calculate the probability of GPA 3 or less we will add the prob of 0,1,2 and 3 GPA
obtained in the Poisson Distribution.

Hence from the calculations done in Python we conclude that :

If a student is chosen randomly, what is the probability that his/her GPA is less than 3
is 39.49%

2.7.2. Find the conditional probability that a randomly selected male earns 50 or
more. Find the conditional probability that a randomly selected female earns 50 or
more.
17

The above distplot represents the salary of all the Male in the population.

As we can see it is normally distributed hence the conditional probability that a randomly
selected male earns 50 or more can be calculated using the Normal distribution.

To calculate this, we will calculate the cumulative probability for less than 50 using Normal
Distribution and then will subtract from 1.
Hence from the calculations done in Python we conclude that :

The Conditional probability that a randomly selected male earns 50 or more is 83.04 %

Find the conditional probability that a randomly selected female earns 50 or more.

The above distplot represents the salary of all the females in the population.
As we can see it is normally distributed hence the conditional probability that a randomly selected female earns
50 or more can be calculated using the Normal distribution.

To calculate this, we will calculate the cumulative probability for less than 50 using Normal Distribution and
then will subtract from 1.

Hence from the calculations done in Python we conclude that :

The Conditional probability that a randomly selected Female earns 50 or more is 86.09 %
18

2.8. Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending, and Text Messages. For each of them, comment whether they
follow a normal distribution. Write a note summarizing your conclusions.

Salary GPA

Spending Text Message

From the above histograms for the continuous variables GPA, Salary, Spending and Text
Messages we can see that :

● GPA is almost Normally Distributed with a slight skewness toward the left.
● Salary is also Normally Distributed with a slight skewness towards the right.
● Spending is not Normally distributed and highly Right Skewed
● Text message is not Normally distributed and highly Right Skewed.

Problem 3

Asphalt Shingles Analysis

Introduction

An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that
they have purchased a product lacking in quality if they find moisture and wet shingles
inside the packaging. In some cases, excessive moisture can cause the granules attached
19
to the shingles for texture and coloring purposes to fall off the shingles resulting in
appearance problems.

Executive Summary:
To monitor the amount of moisture present, the company conducts moisture tests. A
shingle is weighed and then dried. The shingle is then reweighed, and based on the amount
of moisture taken out of the product; the pounds of moisture per 100 square feet are
calculated. The company would like to show that the mean moisture content is less than
0.35 pounds per 100 square feet.

The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for
A shingles and 31 for B shingles.

3.1 Do you think there is evidence that means moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly
showing all steps.

Solution:

As per the problem statement there are two types of shingles i.e. A Shingles & B Shingles so
let us take each shingle separately for this analysis.

For A Shingles

Assumption:
Null Hypothesis (Ho) = Mean moisture contents in A Shingles is not less than permissible limits
Alternate Hypothesis (H1) = Mean moisture contents in A Shingles is less than permissible limits
One sample t-test is used to determine whether an unknown population mean is different from a
specific value. Hence using python, we calculate the t-value and p-value as shown below
Input:
20
Output:

Conclusion:

Since p-value is 0.0748 which is greater than 0.05 so we cannot reject the null hypothesis.
Hence, we can conclude that there is not enough evidence that content for Sample A
shingles is less than permissible limits i.e. less than 0.35 pounds per 100 square feet. If the
population mean moisture content is in fact no less than 0.35 pounds per 100 square feet,
the probability of observing a sample of 36 A shingles will result in a sample mean moisture
content of 0.3167 pounds per 100 square feet or less is 0.0748.

For B Shingles

Assumption:
Null Hypothesis (Ho) = Mean moisture contents in B Shingles is not less than permissible limits
Alternate Hypothesis (H1) = Mean moisture contents in B Shingles is less than permissible limits
One sample t-test is used to determine whether an unknown population mean is different from a
specific value. Hence using python, we calculate the t-value and p-value as shown below

Input:

Output
21
Conclusion:

Since p-value is 0.0021 which is lesser than 0.05 so we can reject the null hypothesis.
Hence, we can conclude that there is enough evidence that content for Sample B shingles is
less than permissible limits i.e. less than 0.35 pounds per 100 square feet. If populations
mean moisture content is in fact less than 0.35 pounds per 100 square feet, the probability
of observing a sample of 31 B shingles will result in a sample mean moisture content of
0.2735 pounds per 100 square feet or less is 0.0021.

3.2 Do you think that the population mean for shingles A and B are
equal? Form the hypothesis and conduct the test of the hypothesis.
What assumption do you need to check before the test for equality
of means is performed?

To perform a Test of equality of the population mean of the A shingles and B shingles, the
null and alternative hypothesis to test whether the population mean moisture content is
equal is given

H0 : mean moisture content of A = mean moisture content of B


HA : mean moisture content of A

mean moisture content of B
Level of significance: 0.05

We have two samples A and B and we do not know the population standard deviation.
The samples are not large samples. So you use the t distribution and the tSTAT test statistic
Since we are testing for equality between sample A and B we use two sample T tests.

Hence from the calculations done in Python we conclude that :

Two-sample t-test p-value= 0.2017496571835306


We do not have enough evidence to reject the null hypothesis in favour of alternative
Hypothesis
since p value > Level of significance

Therefore, It can be concluded that the population mean for shingles A and B are equal

You might also like