SMDM Project Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

SMDM PROJECT

REPORT

0|Page
Contents
SL. No Heading Page No
1 Problem 1- Wholesale Customers Analysis 2
a) Problem 1.1 4
b) Problem 1.2 5
c) Problem 1.3 8
d) Problem 1.4 9
e) Problem 1.5 10

2 Problem 2- Clear Mountain State University Survey 11


a) Problem 2.1 11
b) Problem 2.1.1 11
c) Problem 2.1.2 11
d) Problem 2.1.3 11
e) Problem 2.1.4 12
f) Problem 2.1 12
g) Problem 2.2.1 12
h) Problem 2.2.2 12
i) Problem 2.3 12
j) Problem 2.3.1 12
k) Problem 2.3.2 12
l) Problem 2.4 13
m) Problem 2.4.1 13
n) Problem 2.4.2 13
o) Problem 2.5 13
p) Problem 2.5.1 13
q) Problem 2.5.2 13
s) Problem 2.6 14
t) Problem 2.7 14
u) Problem 2.7.1 14
v) Problem 2.7.2 14
w) Problem 2.8 15

3 Problem 3- A & B shingles testing 16


a) Problem 3.1 17
b) Problem 3.2 17

1|Page
Problem 1

Wholesale Customers Analysis

Problem Statement:

A wholesale distributor operating in different regions of Portugal has information on annual spending
of several items in their stores across different regions and channels. The data consists of 440 large
retailers’ annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto,
Other) and across different sales channel (Hotel, Retail).

1.1 Use methods of descriptive statistics to summarize data. Which Region and which
Channel spent the most? Which Region and which Channel spent the least?

Solution- As First step we have imported all necessary libraries and then CSV file (Wholesale
customer data) is read in python for further data analysis.

Exploratory data analysis: - EDA is an approach to analyzing data sets to summarize their main
characteristics, often using statistical graphics and other data visualization method.

 Describe function has been used to get the statistical data.

 To Detect missing values & to get a concise summary of the data frame isnull and info
function has been used.

 Seaborn pairplot has been used to plot a pairwise relationships in a dataset.

2|Page
 Plotted histogram using matplotlib for each element.

3|Page
1.1 Which Region and which Channel spent the most? Which Region and which
Channel spent the least?

Solution- To get the spending across Region and channel. I have created an new column as
‘Spending’ by adding all the elements (Fresh, Grocery, Detergents Paper, Delicatessen,
Frozen & Milk). After that, we have used groupby function to get the maximum and minimum
spending across each channel and region.

Below is the Output of Channel-

From the above output we can conclude that Hotel Spending is highest and Retail spending is
the lowest.

Below is the Output across Region-

4|Page
From the above output we can conclude that in Region ‘Other’ Spending is highest and in
‘Oporto’ spending is the lowest.

Countplot has been used to get graphical representation of data across Region and Channel.

1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer.

Solution- To get 6 different varieties across each Region and Channel we have used
catplot.

In Figure 1 & 1.1 Item ‘Fresh’, in ‘Hotel’ Channel has maximum spending as compared
to Retail and ‘Other’ region has maximum spending across all regions.

5|Page
In Figure 2 & 2.1 we can see that Item ‘Milk’, in ‘Retail’ Channel has maximum spending
as compared to Hotel and ‘Other’ region has maximum spending across all regions.

In Figure 3 & 3.1 we can see that Item ‘Grocery’, in ‘Retail’ Channel has maximum
spending as compared to Hotel and ‘Oporto’ region has maximum spending across all
regions.

6|Page
In Figure 4 & 4.1 we can see that Item ‘Frozen’, in ‘Hotel’ Channel has maximum
spending as compared to Retail and ‘Oporto’ region has maximum spending across all
regions.

In Figure 5 & 5.1 we can see that Item ‘Detergents Paper’, in ‘Retail’ Channel has
maximum spending as compared to Hotel and ‘Oporto’ region has maximum spending
across all regions.

7|Page
In Figure 6 & 6.1 we can see that Item ‘Delicatessen’, in ‘Retail’ Channel has maximum
spending as compared to Hotel and ‘Other’ region has maximum spending across all
regions.

1.3 On the basis of the descriptive measure of variability, which item shows the most
inconsistent behaviour? Which items shows the least inconsistent behaviour?

Solution- By using Standard deviation function for each item we can compare and see from below
output that – Item ‘Fresh’ has more inconsistence behaviour as it has highest Standard deviation i.e,
12647.33 and Delicatessen shows the least inconsistent behaviour, with Standard deviation 2820.11

Coefficient of Variation (CV) also helps to determine the measure of variability for each item.

CV for Fresh is 1.0527196084948245


CV for Milk is 1.2718508307424503
CV for Grocery6 is 1.193815447749267
CV for Frozen is 1.5785355298607762
CV for Detergents Paper is 1.6527657881041729
CV for Delicatessen is 1.8473041039189306

From the above data we can see that item ‘Delicatessen’ has highest CV as 1.85 and item
‘Fresh’ with lowest CV as 1.05.

8|Page
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with
the help of detailed comments.

From the above Boxplot, it’s evident that each items have outliers.

1.5 On the basis of your analysis, what are your recommendations for the business? How can
your analysis help the business to solve its problem? Answer from the business perspective.

Solution- From the above analysis we can see Spending in Hotel is High as compared to Retail which
should be equal. Also, the spending on Lisbon region is less. So, business needs to focus on that
region as well. There is inconsistent spending for each item. There is negative a weak correlation
between each items which is clear by6 using a heatmap for correlation relations.

9|Page
Correlation Matrix

10 | P a g e
Problem 2

The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates.

Firstly, we have imported all necessary libraries and then read the (Survey 1) data set into python to
analyse the data.

2.1. For this data, construct the following contingency tables (Keep Gender as row variable)

2.1.1. Gender and Major

Solution- Using crosstab function we have constructed contingency table showing correlation
between Gender and Major.

2.1.2. Gender and Grad Intention

contingency table showing correlation between Gender and Grad Intention.

2.1.3. Gender and Employment

contingency table showing correlation between Gender and Employment.

11 | P a g e
2.1.4. Gender and Computer

contingency table showing correlation between Gender and Computer.

2.2. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:

2.2.1. What is the probability that a randomly selected CMSU student will be male?

Solution- To know the probability that a randomly selected CMSU student will be male we have to
first calculate the total number of Male and then divided it with total students. So, after calculation
Proportion of CMSU student will be Male. is 46.8%.

2.2.2. What is the probability that a randomly selected CMSU student will be female?

Solution- To know the probability that a randomly selected CMSU student will be female we have to
first calculate the total number of Female and then divided it with total students. So, after calculation
Proportion of CMSU student will be Female. is 53.2%.

2.3. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:

2.3.1. Find the conditional probability of different majors among the male students in CMSU.

Solution- With the help of contingency table we can find out the probability of different majors among
the male students in CMSU.

Probability of Male student opting for Accounting is. is 13.8%.


Probability of Male student who did not decided yet is. is 10.3%
Probability of Male student opting for CIS is. is 3.5%.
Proportion of Male student opting for Economics/Finance is. is 13.8%
Probability of Male student opting for Other is. is 13.8%
Probability of Male student opting for International Business is. is 6.9%
Probability of Male student opting for Management is. is 20.7%
Probability of Male student opting for Retailing or Marketing is. is 17.2%

2.3.2 Find the conditional probability of different majors among the female students of CMSU.

Solution- With the help of contingency table we can find out the probability of different majors among
the Female students in CMSU.
Probability of Female student opting for Accounting is. is 9.1%
Probability of Female student opting for Other is. is 9.1%
Probability of Female student opting for CIS is. is 9.1%
Probability of Female student opting for Retailing or Marketing is. is 27.3%
Probability of Female student opting for Management is. is 12.1%

12 | P a g e
Probability of Female student opting for Economics or Finance is. is 21.2%
Probability of Female student who did not decided yet is. is 0.0%
Probability of Female student opting for International Business is. is 12.1%

2.4. Assume that the sample is a representative of the population of CMSU. Based on the data,
answer the following question:

2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.

Solution- Using contingency table Gender and Grad Intention we found out the total number of male
Intend to graduate and then divide it with total Male. Post calculation, Probability of Males and intends
to be Graduate. is 27.4%.

2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.

Solution- Using contingency table Gender and Computer we found out the total number of female
Who does not have laptop and then divide it with total Female. Post calculation, Probability of
Females who does not have Laptop. is 6.5%.

2.5. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:

2.5.1. Find the probability that a randomly chosen student is either a male or has full-time
employment?

Solution- Using contingency table Gender and Employment we got the total number of males and
total number of males who are full time employed. After dividing these 2 elements the probability that
a randomly chosen student is either a male or has full-time employment is 24.1%.

2.5.2. Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management.

Solution- Using contingency table Gender and Computer we got the total numbers of females and number of
females Having major in international business or management. Post calculation, Probability of Female
student majoring in international business or management is. is 24.2%.

2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The
Undecided students are not considered now and the table is a 2x2 table. Do you think the
graduate intention and being female are independent events?

Solution- We have constructed a contingency table of Gender and Intent to Graduate at 2 levels
(Yes/No) and have dropped Undecided students. Considering the below table and post calculation,
we found out that-
Probability of Female students who intend to graduate is. is 55.0%
Probability of students who intend to graduate is. is 70.0%
Probability of female is. is 50.0%
Probability of students who Graduation Intent is Yes = 35% (28/80)

Since, probability of all three events are different, we concluded that they are not independent.

13 | P a g e
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages.

Answer the following questions based on the data

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?

Solution- Using contingency table of Gender & GPA we found total students having less than 3 GPA
and total number of students. After calculation Probability of his/her GPA is less than 3 is. is 27.4%.

2.7.2 Find conditional probability that a randomly selected male earns 50 or more. Find
conditional probability that a randomly selected female earns 50 or more.

Solution- Using contingency table of Gender & Salary we found out that-

Probability that a randomly selected female earns 50 or more is 29.0%.

Probability that a randomly selected male earns 50 or more is 22.6%.

2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a normal
distribution. Write a note summarizing your conclusions for this whole Problem 2.

Solution- We have plotted a displot to see whether the numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages are normally distributed or not.

14 | P a g e
From the above graphs we can conclude that GPA and Salary are near to normal distribution whereas
Text Message and Spending is somewhat right skewed and not following a normal distribution.

Also, by using we describe() method we can check the statistical data to check the mean and std of
GPA, Salary, Text Message and Spending. PFB-

There are various test available to check whether a data is normally distributed or not. We have
choose Normal test to know normal distribution. If the P Value is very small, it means it is unlikely that
the data came from a normal distribution. 0.05 is the standard threshold.

Post calculation we can see the below output.

It also implies Text Message & Spending are not normally distributed whereas GPA and Salary are
close to normal distribution.

15 | P a g e
Problem 3

An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount
of moisture the shingles contain when they are packaged. Customers may feel that they have
purchased a product lacking in quality if they find moisture and wet shingles inside the packaging. In
some cases, excessive moisture can cause the granules attached to the shingles for texture and
colouring purposes to fall off the shingles resulting in appearance problems. To monitor the amount of
moisture present, the company conducts moisture tests. A shingle is weighed and then dried. The
shingle is then reweighed, and based on the amount of moisture taken out of the product, the pounds
of moisture per 100 square feet is calculated. The company would like to show that the mean
moisture content is less than 0.35 pound per 100 square feet.

The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles
and 31 for B shingles.

Firstly, we imported all necessary libraries and read the file in jupyter notebook. And the we did
Exploratory Data Analysis.

Descriptive Statistics-

Check for Null Values-

It’s evident that column B have 5 NAN value.

For Graphical representation we have plotted histogram and pairplot-

16 | P a g e
3.1 Do you think there is evidence that means moisture contents in both types of shingles are
within the permissible limits? State your conclusions clearly showing all steps.

Solution- H0 : mean moisture content <=0.35, HA : mean moisture content > 0.35

By using ttest_1samp we calculated T statistic and P value for A & B. Below is the output.

The T statistic for A is: -1.4735046253382782


The corresponding pvalue is : 0.07477633144907513

The T statistic for B is: -3.1003313069986995


The corresponding pvalue is : 0.0020904774003191826

Sample A- Since the P value is > alpha (0.05) we do not reject null hypothesis as we do not have
evidence that mean moisture content in A is less than 0.35 pound per 100 sq. ft.

Sample B- Since the P value is < alpha (0.05) we reject null hypothesis as we have evidence that
mean moisture content in B is not less than 0.35 pound per 100 sq. ft.

3.2 Do you think that the population means for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to check
before the test for equality of means is performed?

Solution- H0 : muA = muB , H1 : muA not = muB

By using 2sampleT-Test(ttest_ind), we find out the below out5put-

The T statistic is: 1.2896282719661123


The corresponding pvalue is : 0.2017496571835306

As, p value is > alpha (0.05) , we do not reject null hypothesis and population means for shingles A
and B are equal.

17 | P a g e
----------------------Thank You---------------------------

18 | P a g e

You might also like