0% found this document useful (0 votes)
216 views14 pages

SMDM Project Report Dipti

The document analyzes annual spending data for 440 large retailers in Portugal across different regions and sales channels. [1] It finds that the "Hotel" channel and "Other" region spent the most overall, while the "Retail" channel and "Oporto" region spent the least. [2] An analysis of spending on six product varieties finds patterns such as "Fresh" having higher demand in "Hotels" while "Grocery" was higher in "Retail". Overall, the data was right-skewed for both regions and channels across all product varieties.

Uploaded by

diptidp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
216 views14 pages

SMDM Project Report Dipti

The document analyzes annual spending data for 440 large retailers in Portugal across different regions and sales channels. [1] It finds that the "Hotel" channel and "Other" region spent the most overall, while the "Retail" channel and "Oporto" region spent the least. [2] An analysis of spending on six product varieties finds patterns such as "Fresh" having higher demand in "Hotels" while "Grocery" was higher in "Retail". Overall, the data was right-skewed for both regions and channels across all product varieties.

Uploaded by

diptidp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DSBA

DIPTI PATIL
Problem 1
A wholesale distributor operating in different regions of Portugal has information on annual spending of several
items in their stores across different regions and channels. The data consists of 440 large retailers’ annual
spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto, Other) and across different
sales channel (Hotel, Retail).

Explatory Data Analysis:

Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 1 Retail Other 12669 9656 7561 214 2674 1338

1 2 Retail Other 7057 9810 9568 1762 3293 1776

2 3 Retail Other 6353 8808 7684 2405 3516 7844

3 4 Hotel Other 13265 1196 4221 6404 507 1788

4 5 Retail Other 22615 5410 7198 3915 1777 5185

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Buyer/Spender 440 non-null int64
1 Channel 440 non-null object
2 Region 440 non-null object
3 Fresh 440 non-null int64
4 Milk 440 non-null int64
5 Grocery 440 non-null int64
6 Frozen 440 non-null int64
7 DetergentsPaper 440 non-null int64
8 Delicatessen 440 non-null int64
dtypes: int64(7), object(2)
memory usage: 31.1+ KB

1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel spent the most?
Which Region and which Channel spent the least?

Buyer/
Spender Channel Region Fresh Milk Grocery Frozen DetergentsPaper Delicatessen

count 440.00 440 440 440.00 440.00 440.00 440.00 440.00 440.00

unique nan 2 3 nan nan nan nan nan nan

top nan Hotel Other nan nan nan nan nan nan

freq nan 298 316 nan nan nan nan nan nan

mean 220.50 NaN NaN 12000.30 5796.27 7951.28 3071.93 2881.49 1524.87

std 127.16 NaN NaN 12647.33 7380.38 9503.16 4854.67 4767.85 2820.11

min 1.00 NaN NaN 3.00 55.00 3.00 25.00 3.00 3.00
25% 110.75 NaN NaN 3127.75 1533.00 2153.00 742.25 256.75 408.25

50% 220.50 NaN NaN 8504.00 3627.00 4755.50 1526.00 816.50 965.50

75% 330.25 NaN NaN 16933.75 7190.25 10655.75 3554.25 3922.00 1820.25

max 440.00 NaN NaN 112151.00 73498.00 92780.00 60869.00 40827.00 47943.00

➢ The dataset confirms that it has 440 observation and 9 variables

• 2 are categorical variable and 7 are Numerical(continuous) variables.

• No variable or column has null or missing value

• After using the describe function, we can see that the mean values are greater than the median values
for all Variable. Also, Median-Q1<Q3-Median for all Variables. This means that the data is Right skewed.

• There is significant difference between 75th % and max values. This means that there are outliers in the
variables. We will further verify the outliers by using boxplot.

➢ Total spending for several items Channel wise:

Channel Buyer/Spender Fresh Milk Grocery Frozen DetergentsPaper Delicatessen Total

Hotel 71034 4015717 1028614 1180717 1116979 235587 421955 7999569

Retail 25986 1264414 1521743 2317845 234671 1032270 248988 6619931

• Channel wise ’Hotel’ spends most annually, with Highest spending on ‘Fresh’ Item and least spending on
‘Detergent & Paper’.

• Channel wise ‘Retail’ spends least annually, with Highest spending on ‘Grocery’ Item and least spending
on ‘Frozen’.

➢ Total spending for several items Region wise:

Region Buyer/Spender Fresh Milk Grocery Frozen DetergentsPaper Delicatessen Total

Lisbon 18095 854833 422454 570037 231026 204136 104327 2386813

Oporto 14899 464721 239144 433274 190132 173311 54506 1555088

Other 64026 3960577 1888759 2495251 930492 890410 512110 10677599

• Region wise ‘Other’ spends most annually, with Highest spending on ‘Fresh Item’ and least spending on
‘Delicatessen’.

• Region wise ‘Oporto’ spends least annually, with Highest spending on ‘Fresh Item’ and least spending
on ‘Delicatessen’.
1.2 There are 6 different varieties of items that are considered. Describe and comment/explain all the varieties
across Region and Channel? Provide a detailed justification for your answer.

For Item: Fresh

Hotel has high average demand than Retail


Channel wise across all the Regions.

Other Region has exceptionally high Demand


Region wise.

Overall according to the boxplot, the data is Right


skewed for both Region and Channel wise.

For Item: Milk

Retail has very high average demand than Hotel


Channel wise across all the Regions.

Other Region has exceptionally high Demand


Region wise.

Overall according to the boxplot, the data is


Right skewed for both Region and Channel wise.

For Item: Grocery

Retail has very high average demand than Hotel


Channel wise across all the Regions.

Oporto Region has least Demand Region wise.

Overall according to the boxplot, the data is


Right skewed for both Region and Channel wise.
For Item: Frozen

Hotel has high average demand than Retail


Channel wise across all the Regions.

Other Region has highest Demand and Oporto


have consistent demand Region wise.

Overall according to the boxplot, the data is


Right skewed for both Region and Channel
wise.

For Item: Detergents & Paper

Retail has very high average demand than


Hotel Channel wise across all the Regions.

Lisbon & Oporto Region has almost equal


average Demand Region wise.

Overall according to the boxplot, the data is


Right skewed for both Region and Channel
wise.

For Item: Delicatessen

Retail has high average demand than Hotel


Channel wise across all the Regions.

Lisbon & Other Regions has almost equal


average Demand Region wise.

Overall according to the boxplot, the data


is Right skewed for both Region and
Channel wise.

1.3 On the basis of a descriptive measure of variability, which item shows the most inconsistent behaviour?
Which items show the least inconsistent behaviour?

➢ Coefficient Of variation for ‘Fresh’ is 1.0539179237473149(Lowest)


Coefficient Of variation for ‘Milk’ is 1.2732985840065414
Coefficient Of variation for ‘Grocery’ is 1.1951743730016824
Coefficient Of variation for ‘Frozen’ is 1.5803323836352914
Coefficient Of variation for ‘Detergents & Paper’ is 1.6546471385005155
Coefficient Of variation for ‘Delicatessen’ is 1.8494068981158382 (Highest)

➢ ‘Delicatessen’ have most inconsistent behaviour as it has high Coefficient of Variation.


➢ ‘Fresh’ have least inconsistent behaviour as it has low Coefficient of Variation.
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with the help of
detailed comments.

➢ As we can see from the Plot above that all the varieties have Outliers. ‘Fresh’ and ‘Grocery’ have highest
Outliers.

1.5 On the basis of your analysis, what are your recommendations for the business? How can your analysis help
the business to solve its problem? Answer from the business perspective.

➢ ‘Fresh’ Variety have highest demand Channel wise and Region wise.

➢ ‘Fresh’ & ‘Frozen’ Variety have high demand in Hotel Channel.

➢ ‘Milk’, ‘Detergents & Paper’ & ‘Grocery’ have high demand in Retail Channel than Hotel.

➢ Other Region have highest demand for all Varieties than Lisbon & Oporto.

➢ To maximize the profits, wholesaler must keep the consistent supply and stocks of varieties according
to the above points.
Problem 2:

The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the
undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14 questions and receives
responses from 62 undergraduates Problem 2 - (Download Data)

Exploratory Data Analysis:

ID Gender Age Class Major Grad GPA Employment Sala Social Satisfa Spending Comput Text
Intention ry Networ ction er Message
king s

0 1 Female 20 Junior Other Yes 2.9 Full-Time 50.0 1 3 350 Laptop 200

1 2 Male 23 Senior Manageme Yes 3.6 Part-Time 25.0 1 4 360 Laptop 50


nt

2 3 Male 21 Junior Other Yes 2.5 Part-Time 45.0 2 4 600 Laptop 200

3 4 Male 21 Junior CIS Yes 2.5 Full-Time 40.0 4 6 600 Laptop 250

4 5 Male 23 Senior Other Undecided 2.8 Unemployed 40.0 2 4 500 Laptop 100

Descriptive Analysis:

ID Age GPA Salary SocialNetworking Satisfaction Spending TextMessages

count 62.00 62.00 62.00 62.00 62.00 62.00 62.00 62.00

mean 31.50 21.13 3.13 48.55 1.52 3.74 482.02 246.21

std 18.04 1.43 0.38 12.08 0.84 1.21 221.95 214.47

min 1.00 18.00 2.30 25.00 0.00 1.00 100.00 0.00

25% 16.25 20.00 2.90 40.00 1.00 3.00 312.50 100.00

50% 31.50 21.00 3.15 50.00 1.00 4.00 500.00 200.00

75% 46.75 22.00 3.40 55.00 2.00 4.00 600.00 300.00

max 62.00 26.00 3.90 80.00 4.00 6.00 1400.00 900.00

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 62 non-null int64
1 Gender 62 non-null object
2 Age 62 non-null int64
3 Class 62 non-null object
4 Major 62 non-null object
5 GradIntention 62 non-null object
6 GPA 62 non-null float64
7 Employment 62 non-null object
8 Salary 62 non-null float64
9 SocialNetworking 62 non-null int64
10 Satisfaction 62 non-null int64
11 Spending 62 non-null int64
12 Computer 62 non-null object
13 TextMessages 62 non-null int64
dtypes: float64(2), int64(6), object(6) memory usage: 6.9+ KB

➢ The dataset confirms that it has 62 observation and 14 variables

• 6 are categorical variable and 8 are Numerical(continuous) variables.

• No variable or column has null or missing value

• The Descriptive Analysis shows that Avg age is 21, Avg GPA score is 3.13 & Avg Salary is 48.55

• Also, the mean values are almost equal to the median values for all Continuous Variable. This means
that the data is almost symmetrical to the median.

2.1. For this data, construct the following contingency tables (Keep Gender as row variable)
2.1.1. Gender and Major

2.1.2. Gender and Grad Intention

2.1.3. Gender and Employment

2.1.4. Gender and Computer


2.2. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question:

2.2.1. What is the probability that a randomly selected CMSU student will be male?

➢ The data have 62 entries, in which male student are 29 and female student are 33.
➢ The probability of randomly selected CMSU student will be male is 46.77%
2.2.2. What is the probability that a randomly selected CMSU student will be female?
➢ The probability of randomly selected CMSU student will be female is 53.22%

2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question:

2.3.1. Find the conditional probability of different majors among the male students in CMSU.

➢ According to the data 3 Male are undecided about Major.

➢ Probability of Male in Accounting Major is 13.79%

➢ Probability of Male in CIS Major is 3.44%

➢ Probability of Male in Economics/Finance Major is 13.79%

➢ Probability of Male in International Business is 6.89%

➢ Probability of Male in Management Major is 20.68%

➢ Probability of Male in Other Major is 13.79%

➢ Probability of Male in Retailing/Marketing Major is 17.24%

➢ Probability of Male in Undecided is 10.34%

2.3.2. Find the conditional probability of different majors among the Female students in CMSU.

➢ According to the data All Female have decided there Major.

➢ Probability of Female in Accounting Major is 09.09 %

➢ Probability of Female in CIS Major is 9.09%

➢ Probability of Female in Economics/Finance Major is 21.21%

➢ Probability of Female in International Business is 12.12%

➢ Probability of Female in Management Major is 12.12%

➢ Probability of Female in Other Major is 09.09%


➢ Probability of Female in Retailing/Marketing Major is 27.27%

According to the above data, the Female choose ‘Retailing /Marketing’ Major and Male choose ‘Management’
Major as maximum.

2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the following
question:

2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.

➢ The probability of randomly chosen student is a male and intends to graduate is 17/62 = 27.41%

2.4.2 Find the probability that a randomly selected student is a female and does NOT have a laptop.

➢ The probability of randomly chosen student is a female and does NOT have a laptop is 4/62 = 06.45%

2.5. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question:

2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment?

Number of Male student = 29

Number of Full time employment student = 10

Number of Male who have full time employment = 7

➢ The probability that a randomly chosen student is a male or has full-time employment is:
(29+10-7) /62 = 51.61%

2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring in
international business or management.

Number of female student in International Business Major = 4

Number of female student in Management Major = 4

➢ The conditional probability that given a female student is randomly chosen, she is majoring in
international business or management is: (4+4)/33 = 24.24%

2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The Undecided
students are not considered now and the table is a 2x2 table. Do you think the graduate intention and being
female are independent events?

Total number of Female student = 33, P(F) = 33/62 = .53

Total number of student Intent to Graduate =28, P(GYes) = 28/62 = .45

Total number of Female and graduate Intention = 11, P(F ∩ GYes) = 11/33 =
0.33

We need to check this condition: P(F ∩ GYes) = P(F)P(GYes)

So, P(F)P(GYes) = 0.53*0.45 = 0.23 , which is not equal to 0.33(P(F ∩ GYes)).

➢ This implies Graduate Intention and being Female are Independent events.

2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages.

Answer the following questions based on the data

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?

Student with GPA score <3 : 17

➢ The probability of randomly chosen student have GPA score <3 is 17/62 = 27.41%

2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the conditional
probability that a randomly selected female earns 50 or more

Female earns 50 or more = 18

Male earns 50 or more = 14

➢ The conditional probability that a randomly selected Female earns 50 or more : 18/33 = 54.54%
➢ The conditional probability that a randomly selected Male earns 50 or more : 14/29 = 48.27%
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages. For each of them comment whether they follow a normal distribution. Write a note summarizing
your conclusions for this whole Problem 2.

• From the boxplots plotted above, we can gather that:


• The distribution of Salary & spending left skewed and has outliers
• The distribution of Text messages is right skewed and has outliers
• GPA follows a normal distribution and does not have any outliers

Problem 3

An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount of
moisture the shingles contain when they are packaged. Customers may feel that they have purchased a product
lacking in quality if they find moisture and wet shingles inside the packaging. In some cases, excessive moisture
can cause the granules attached to the shingles for texture and colouring purposes to fall off the shingles
resulting in appearance problems. To monitor the amount of moisture present, the company conducts moisture
tests. A shingle is weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet is calculated. The company would like to
show that the mean moisture content is less than 0.35 pound per 100 square feet.

The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles and 31 for
B shingles.
➢ Observations & Assumptions:
• Sample size n is greater than 30 for both the samples A & B
• Consider alpha or level of significance is 0.05
• Population standard deviation is unknown hence we are going to use t-test

• Step 1 : Define Hypothesis


Null hypothesis,Ho : Moisture content <= 0.35
Alternative hypothesis, Ha: Moisture content > 0.35

• Step 2 : Define alpha at 0.05 significance

• Step 3 : Compute the test statistic, t

Calculate population mean moisture content for shingles A:


Degree of freedom= n-1 = 36-1 = 35

Calculate population mean moisture content for shingles B:


Degree of freedom= n-1 = 31-1 = 30

• Step 4: Calculate the p - value and test statistic


One sample t-test for shingles A: t stat= -1.4735046253382782 & p_value = 0.14955266

One sample t-test for shingles B: t stat= -3.1003313069986995 & p_value = 0.00418095

• Step 5: Decide to reject or accept null hypothesis

For Shingles A: p_value > Level of Significance i.e 0.14955266 > .05 (Cannot Reject Null Hypothesis)
For Shingles B: p_value < Level of Significance i.e 0.00418095 < .05 (Can Reject Null Hypothesis)

Conclusion:
For Shingles A, there is not enough evidence to support the claim that the population mean moisture
content of shingles A is less than 0.35 pound per 100 square feet, at the 0.05 significance level.

For Shingles B, there is enough evidence to support the claim that the population mean moisture
content of shingles B is less than 0.35 pound per 100 square feet, at the 0.05 significance level.

3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and
conduct the test of the hypothesis. What assumption do you need to check before the test for equality
of means is performed?

• Step 1 : Define Hypothesis


Null hypothesis,Ho : µA = µB
Alternative hypothesis, Ha: µA ≠ µB

• Step 2 : Define alpha at 0.05 significance

• Step 3 : Compute the test statistic, t


• We have 2 samples for this test & Population std deviation is not provided.
• Sample size of both sample is different.
• This is 2 tailed test.
• Degree of freedom for Shingles A sample is = n-1 = 36-1 = 35
• Degree of freedom for Shingles B sample is = n-1 = 31-1 = 30

• Step 4: Calculate the p - value and test statistic

Two sample t-test for shingles A & B: t stat= 1.2896282719661123& p_value = 0.2017496571835306

• Step 5: Decide to reject or accept null hypothesis

p_value > Level of Significance i.e 0.2017496571835306 > .05 (Cannot Reject Null Hypothesis)

Conclusion:
The results indicate that we cannot Reject Null Hypothesis.

Hence, there is enough evidence to support the claim that the population means for shingles A and B
are equal.

You might also like