SMDM Project Report
SMDM Project Report
SMDM Project Report
REPORT
0|Page
Contents
SL. No Heading Page No
1 Problem 1- Wholesale Customers Analysis 2
a) Problem 1.1 4
b) Problem 1.2 5
c) Problem 1.3 8
d) Problem 1.4 9
e) Problem 1.5 10
1|Page
Problem 1
Problem Statement:
A wholesale distributor operating in different regions of Portugal has information on annual spending
of several items in their stores across different regions and channels. The data consists of 440 large
retailers’ annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto,
Other) and across different sales channel (Hotel, Retail).
1.1 Use methods of descriptive statistics to summarize data. Which Region and which
Channel spent the most? Which Region and which Channel spent the least?
Solution- As First step we have imported all necessary libraries and then CSV file (Wholesale
customer data) is read in python for further data analysis.
Exploratory data analysis: - EDA is an approach to analyzing data sets to summarize their main
characteristics, often using statistical graphics and other data visualization method.
To Detect missing values & to get a concise summary of the data frame isnull and info
function has been used.
2|Page
Plotted histogram using matplotlib for each element.
3|Page
1.1 Which Region and which Channel spent the most? Which Region and which
Channel spent the least?
Solution- To get the spending across Region and channel. I have created an new column as
‘Spending’ by adding all the elements (Fresh, Grocery, Detergents Paper, Delicatessen,
Frozen & Milk). After that, we have used groupby function to get the maximum and minimum
spending across each channel and region.
From the above output we can conclude that Hotel Spending is highest and Retail spending is
the lowest.
4|Page
From the above output we can conclude that in Region ‘Other’ Spending is highest and in
‘Oporto’ spending is the lowest.
Countplot has been used to get graphical representation of data across Region and Channel.
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer.
Solution- To get 6 different varieties across each Region and Channel we have used
catplot.
In Figure 1 & 1.1 Item ‘Fresh’, in ‘Hotel’ Channel has maximum spending as compared
to Retail and ‘Other’ region has maximum spending across all regions.
5|Page
In Figure 2 & 2.1 we can see that Item ‘Milk’, in ‘Retail’ Channel has maximum spending
as compared to Hotel and ‘Other’ region has maximum spending across all regions.
In Figure 3 & 3.1 we can see that Item ‘Grocery’, in ‘Retail’ Channel has maximum
spending as compared to Hotel and ‘Oporto’ region has maximum spending across all
regions.
6|Page
In Figure 4 & 4.1 we can see that Item ‘Frozen’, in ‘Hotel’ Channel has maximum
spending as compared to Retail and ‘Oporto’ region has maximum spending across all
regions.
In Figure 5 & 5.1 we can see that Item ‘Detergents Paper’, in ‘Retail’ Channel has
maximum spending as compared to Hotel and ‘Oporto’ region has maximum spending
across all regions.
7|Page
In Figure 6 & 6.1 we can see that Item ‘Delicatessen’, in ‘Retail’ Channel has maximum
spending as compared to Hotel and ‘Other’ region has maximum spending across all
regions.
1.3 On the basis of the descriptive measure of variability, which item shows the most
inconsistent behaviour? Which items shows the least inconsistent behaviour?
Solution- By using Standard deviation function for each item we can compare and see from below
output that – Item ‘Fresh’ has more inconsistence behaviour as it has highest Standard deviation i.e,
12647.33 and Delicatessen shows the least inconsistent behaviour, with Standard deviation 2820.11
Coefficient of Variation (CV) also helps to determine the measure of variability for each item.
From the above data we can see that item ‘Delicatessen’ has highest CV as 1.85 and item
‘Fresh’ with lowest CV as 1.05.
8|Page
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with
the help of detailed comments.
From the above Boxplot, it’s evident that each items have outliers.
1.5 On the basis of your analysis, what are your recommendations for the business? How can
your analysis help the business to solve its problem? Answer from the business perspective.
Solution- From the above analysis we can see Spending in Hotel is High as compared to Retail which
should be equal. Also, the spending on Lisbon region is less. So, business needs to focus on that
region as well. There is inconsistent spending for each item. There is negative a weak correlation
between each items which is clear by6 using a heatmap for correlation relations.
9|Page
Correlation Matrix
10 | P a g e
Problem 2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates.
Firstly, we have imported all necessary libraries and then read the (Survey 1) data set into python to
analyse the data.
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)
Solution- Using crosstab function we have constructed contingency table showing correlation
between Gender and Major.
11 | P a g e
2.1.4. Gender and Computer
2.2. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
Solution- To know the probability that a randomly selected CMSU student will be male we have to
first calculate the total number of Male and then divided it with total students. So, after calculation
Proportion of CMSU student will be Male. is 46.8%.
2.2.2. What is the probability that a randomly selected CMSU student will be female?
Solution- To know the probability that a randomly selected CMSU student will be female we have to
first calculate the total number of Female and then divided it with total students. So, after calculation
Proportion of CMSU student will be Female. is 53.2%.
2.3. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
2.3.1. Find the conditional probability of different majors among the male students in CMSU.
Solution- With the help of contingency table we can find out the probability of different majors among
the male students in CMSU.
2.3.2 Find the conditional probability of different majors among the female students of CMSU.
Solution- With the help of contingency table we can find out the probability of different majors among
the Female students in CMSU.
Probability of Female student opting for Accounting is. is 9.1%
Probability of Female student opting for Other is. is 9.1%
Probability of Female student opting for CIS is. is 9.1%
Probability of Female student opting for Retailing or Marketing is. is 27.3%
Probability of Female student opting for Management is. is 12.1%
12 | P a g e
Probability of Female student opting for Economics or Finance is. is 21.2%
Probability of Female student who did not decided yet is. is 0.0%
Probability of Female student opting for International Business is. is 12.1%
2.4. Assume that the sample is a representative of the population of CMSU. Based on the data,
answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.
Solution- Using contingency table Gender and Grad Intention we found out the total number of male
Intend to graduate and then divide it with total Male. Post calculation, Probability of Males and intends
to be Graduate. is 27.4%.
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.
Solution- Using contingency table Gender and Computer we found out the total number of female
Who does not have laptop and then divide it with total Female. Post calculation, Probability of
Females who does not have Laptop. is 6.5%.
2.5. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:
2.5.1. Find the probability that a randomly chosen student is either a male or has full-time
employment?
Solution- Using contingency table Gender and Employment we got the total number of males and
total number of males who are full time employed. After dividing these 2 elements the probability that
a randomly chosen student is either a male or has full-time employment is 24.1%.
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management.
Solution- Using contingency table Gender and Computer we got the total numbers of females and number of
females Having major in international business or management. Post calculation, Probability of Female
student majoring in international business or management is. is 24.2%.
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The
Undecided students are not considered now and the table is a 2x2 table. Do you think the
graduate intention and being female are independent events?
Solution- We have constructed a contingency table of Gender and Intent to Graduate at 2 levels
(Yes/No) and have dropped Undecided students. Considering the below table and post calculation,
we found out that-
Probability of Female students who intend to graduate is. is 55.0%
Probability of students who intend to graduate is. is 70.0%
Probability of female is. is 50.0%
Probability of students who Graduation Intent is Yes = 35% (28/80)
Since, probability of all three events are different, we concluded that they are not independent.
13 | P a g e
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages.
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?
Solution- Using contingency table of Gender & GPA we found total students having less than 3 GPA
and total number of students. After calculation Probability of his/her GPA is less than 3 is. is 27.4%.
2.7.2 Find conditional probability that a randomly selected male earns 50 or more. Find
conditional probability that a randomly selected female earns 50 or more.
Solution- Using contingency table of Gender & Salary we found out that-
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a normal
distribution. Write a note summarizing your conclusions for this whole Problem 2.
Solution- We have plotted a displot to see whether the numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages are normally distributed or not.
14 | P a g e
From the above graphs we can conclude that GPA and Salary are near to normal distribution whereas
Text Message and Spending is somewhat right skewed and not following a normal distribution.
Also, by using we describe() method we can check the statistical data to check the mean and std of
GPA, Salary, Text Message and Spending. PFB-
There are various test available to check whether a data is normally distributed or not. We have
choose Normal test to know normal distribution. If the P Value is very small, it means it is unlikely that
the data came from a normal distribution. 0.05 is the standard threshold.
It also implies Text Message & Spending are not normally distributed whereas GPA and Salary are
close to normal distribution.
15 | P a g e
Problem 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount
of moisture the shingles contain when they are packaged. Customers may feel that they have
purchased a product lacking in quality if they find moisture and wet shingles inside the packaging. In
some cases, excessive moisture can cause the granules attached to the shingles for texture and
colouring purposes to fall off the shingles resulting in appearance problems. To monitor the amount of
moisture present, the company conducts moisture tests. A shingle is weighed and then dried. The
shingle is then reweighed, and based on the amount of moisture taken out of the product, the pounds
of moisture per 100 square feet is calculated. The company would like to show that the mean
moisture content is less than 0.35 pound per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles
and 31 for B shingles.
Firstly, we imported all necessary libraries and read the file in jupyter notebook. And the we did
Exploratory Data Analysis.
Descriptive Statistics-
16 | P a g e
3.1 Do you think there is evidence that means moisture contents in both types of shingles are
within the permissible limits? State your conclusions clearly showing all steps.
Solution- H0 : mean moisture content <=0.35, HA : mean moisture content > 0.35
By using ttest_1samp we calculated T statistic and P value for A & B. Below is the output.
Sample A- Since the P value is > alpha (0.05) we do not reject null hypothesis as we do not have
evidence that mean moisture content in A is less than 0.35 pound per 100 sq. ft.
Sample B- Since the P value is < alpha (0.05) we reject null hypothesis as we have evidence that
mean moisture content in B is not less than 0.35 pound per 100 sq. ft.
3.2 Do you think that the population means for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to check
before the test for equality of means is performed?
As, p value is > alpha (0.05) , we do not reject null hypothesis and population means for shingles A
and B are equal.
17 | P a g e
----------------------Thank You---------------------------
18 | P a g e