SMDM Project Report Dipti
SMDM Project Report Dipti
DIPTI PATIL
Problem 1
A wholesale distributor operating in different regions of Portugal has information on annual spending of several
items in their stores across different regions and channels. The data consists of 440 large retailers’ annual
spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto, Other) and across different
sales channel (Hotel, Retail).
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Buyer/Spender 440 non-null int64
1 Channel 440 non-null object
2 Region 440 non-null object
3 Fresh 440 non-null int64
4 Milk 440 non-null int64
5 Grocery 440 non-null int64
6 Frozen 440 non-null int64
7 DetergentsPaper 440 non-null int64
8 Delicatessen 440 non-null int64
dtypes: int64(7), object(2)
memory usage: 31.1+ KB
1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel spent the most?
Which Region and which Channel spent the least?
Buyer/
Spender Channel Region Fresh Milk Grocery Frozen DetergentsPaper Delicatessen
count 440.00 440 440 440.00 440.00 440.00 440.00 440.00 440.00
top nan Hotel Other nan nan nan nan nan nan
freq nan 298 316 nan nan nan nan nan nan
mean 220.50 NaN NaN 12000.30 5796.27 7951.28 3071.93 2881.49 1524.87
std 127.16 NaN NaN 12647.33 7380.38 9503.16 4854.67 4767.85 2820.11
min 1.00 NaN NaN 3.00 55.00 3.00 25.00 3.00 3.00
25% 110.75 NaN NaN 3127.75 1533.00 2153.00 742.25 256.75 408.25
50% 220.50 NaN NaN 8504.00 3627.00 4755.50 1526.00 816.50 965.50
75% 330.25 NaN NaN 16933.75 7190.25 10655.75 3554.25 3922.00 1820.25
max 440.00 NaN NaN 112151.00 73498.00 92780.00 60869.00 40827.00 47943.00
• After using the describe function, we can see that the mean values are greater than the median values
for all Variable. Also, Median-Q1<Q3-Median for all Variables. This means that the data is Right skewed.
• There is significant difference between 75th % and max values. This means that there are outliers in the
variables. We will further verify the outliers by using boxplot.
• Channel wise ’Hotel’ spends most annually, with Highest spending on ‘Fresh’ Item and least spending on
‘Detergent & Paper’.
• Channel wise ‘Retail’ spends least annually, with Highest spending on ‘Grocery’ Item and least spending
on ‘Frozen’.
• Region wise ‘Other’ spends most annually, with Highest spending on ‘Fresh Item’ and least spending on
‘Delicatessen’.
• Region wise ‘Oporto’ spends least annually, with Highest spending on ‘Fresh Item’ and least spending
on ‘Delicatessen’.
1.2 There are 6 different varieties of items that are considered. Describe and comment/explain all the varieties
across Region and Channel? Provide a detailed justification for your answer.
1.3 On the basis of a descriptive measure of variability, which item shows the most inconsistent behaviour?
Which items show the least inconsistent behaviour?
➢ As we can see from the Plot above that all the varieties have Outliers. ‘Fresh’ and ‘Grocery’ have highest
Outliers.
1.5 On the basis of your analysis, what are your recommendations for the business? How can your analysis help
the business to solve its problem? Answer from the business perspective.
➢ ‘Fresh’ Variety have highest demand Channel wise and Region wise.
➢ ‘Milk’, ‘Detergents & Paper’ & ‘Grocery’ have high demand in Retail Channel than Hotel.
➢ Other Region have highest demand for all Varieties than Lisbon & Oporto.
➢ To maximize the profits, wholesaler must keep the consistent supply and stocks of varieties according
to the above points.
Problem 2:
The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the
undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14 questions and receives
responses from 62 undergraduates Problem 2 - (Download Data)
ID Gender Age Class Major Grad GPA Employment Sala Social Satisfa Spending Comput Text
Intention ry Networ ction er Message
king s
0 1 Female 20 Junior Other Yes 2.9 Full-Time 50.0 1 3 350 Laptop 200
2 3 Male 21 Junior Other Yes 2.5 Part-Time 45.0 2 4 600 Laptop 200
3 4 Male 21 Junior CIS Yes 2.5 Full-Time 40.0 4 6 600 Laptop 250
4 5 Male 23 Senior Other Undecided 2.8 Unemployed 40.0 2 4 500 Laptop 100
Descriptive Analysis:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 62 non-null int64
1 Gender 62 non-null object
2 Age 62 non-null int64
3 Class 62 non-null object
4 Major 62 non-null object
5 GradIntention 62 non-null object
6 GPA 62 non-null float64
7 Employment 62 non-null object
8 Salary 62 non-null float64
9 SocialNetworking 62 non-null int64
10 Satisfaction 62 non-null int64
11 Spending 62 non-null int64
12 Computer 62 non-null object
13 TextMessages 62 non-null int64
dtypes: float64(2), int64(6), object(6) memory usage: 6.9+ KB
• The Descriptive Analysis shows that Avg age is 21, Avg GPA score is 3.13 & Avg Salary is 48.55
• Also, the mean values are almost equal to the median values for all Continuous Variable. This means
that the data is almost symmetrical to the median.
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)
2.1.1. Gender and Major
2.2.1. What is the probability that a randomly selected CMSU student will be male?
➢ The data have 62 entries, in which male student are 29 and female student are 33.
➢ The probability of randomly selected CMSU student will be male is 46.77%
2.2.2. What is the probability that a randomly selected CMSU student will be female?
➢ The probability of randomly selected CMSU student will be female is 53.22%
2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question:
2.3.1. Find the conditional probability of different majors among the male students in CMSU.
2.3.2. Find the conditional probability of different majors among the Female students in CMSU.
According to the above data, the Female choose ‘Retailing /Marketing’ Major and Male choose ‘Management’
Major as maximum.
2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the following
question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.
➢ The probability of randomly chosen student is a male and intends to graduate is 17/62 = 27.41%
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a laptop.
➢ The probability of randomly chosen student is a female and does NOT have a laptop is 4/62 = 06.45%
2.5. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment?
➢ The probability that a randomly chosen student is a male or has full-time employment is:
(29+10-7) /62 = 51.61%
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring in
international business or management.
➢
➢ The conditional probability that given a female student is randomly chosen, she is majoring in
international business or management is: (4+4)/33 = 24.24%
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The Undecided
students are not considered now and the table is a 2x2 table. Do you think the graduate intention and being
female are independent events?
Total number of Female and graduate Intention = 11, P(F ∩ GYes) = 11/33 =
0.33
➢ This implies Graduate Intention and being Female are Independent events.
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages.
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?
➢ The probability of randomly chosen student have GPA score <3 is 17/62 = 27.41%
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the conditional
probability that a randomly selected female earns 50 or more
➢ The conditional probability that a randomly selected Female earns 50 or more : 18/33 = 54.54%
➢ The conditional probability that a randomly selected Male earns 50 or more : 14/29 = 48.27%
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages. For each of them comment whether they follow a normal distribution. Write a note summarizing
your conclusions for this whole Problem 2.
Problem 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount of
moisture the shingles contain when they are packaged. Customers may feel that they have purchased a product
lacking in quality if they find moisture and wet shingles inside the packaging. In some cases, excessive moisture
can cause the granules attached to the shingles for texture and colouring purposes to fall off the shingles
resulting in appearance problems. To monitor the amount of moisture present, the company conducts moisture
tests. A shingle is weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet is calculated. The company would like to
show that the mean moisture content is less than 0.35 pound per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles and 31 for
B shingles.
➢ Observations & Assumptions:
• Sample size n is greater than 30 for both the samples A & B
• Consider alpha or level of significance is 0.05
• Population standard deviation is unknown hence we are going to use t-test
One sample t-test for shingles B: t stat= -3.1003313069986995 & p_value = 0.00418095
For Shingles A: p_value > Level of Significance i.e 0.14955266 > .05 (Cannot Reject Null Hypothesis)
For Shingles B: p_value < Level of Significance i.e 0.00418095 < .05 (Can Reject Null Hypothesis)
Conclusion:
For Shingles A, there is not enough evidence to support the claim that the population mean moisture
content of shingles A is less than 0.35 pound per 100 square feet, at the 0.05 significance level.
For Shingles B, there is enough evidence to support the claim that the population mean moisture
content of shingles B is less than 0.35 pound per 100 square feet, at the 0.05 significance level.
3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and
conduct the test of the hypothesis. What assumption do you need to check before the test for equality
of means is performed?
Two sample t-test for shingles A & B: t stat= 1.2896282719661123& p_value = 0.2017496571835306
p_value > Level of Significance i.e 0.2017496571835306 > .05 (Cannot Reject Null Hypothesis)
Conclusion:
The results indicate that we cannot Reject Null Hypothesis.
Hence, there is enough evidence to support the claim that the population means for shingles A and B
are equal.