0% found this document useful (0 votes)
145 views29 pages

SMDM Project

The document provides an analysis of data from 440 large retailers on their annual spending across 6 product varieties in 3 Portuguese regions and 2 sales channels. Key findings from the analysis include: - The region that spent the most was "Other" and the region that spent the least was "Oporto". The channel that spent the most was "Hotel" and the channel that spent the least was "Retail". - Across regions and channels, spending was highest on "Fresh" products and lowest on "Delicatessen". - "Fresh products" showed the most inconsistent behavior across regions and channels, while "Delicatessen" showed the least inconsistent behavior. - All 6 product varieties showed

Uploaded by

crispin anthony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views29 pages

SMDM Project

The document provides an analysis of data from 440 large retailers on their annual spending across 6 product varieties in 3 Portuguese regions and 2 sales channels. Key findings from the analysis include: - The region that spent the most was "Other" and the region that spent the least was "Oporto". The channel that spent the most was "Hotel" and the channel that spent the least was "Retail". - Across regions and channels, spending was highest on "Fresh" products and lowest on "Delicatessen". - "Fresh products" showed the most inconsistent behavior across regions and channels, while "Delicatessen" showed the least inconsistent behavior. - All 6 product varieties showed

Uploaded by

crispin anthony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

SMDM PROJECT

Business Report On Wholesale Customers Analysis


CONTEXT:
A wholesale distributor operating in different regions of Portugal has information on the
annual spending of several items in their stores across different regions and channels. The
data consists of 440 large retailers’ annual spending on 6 different varieties of products in 3
different regions (Lisbon, Oporto, Other) and across different sales channels (Hotel, Retail).

1.1.1 Use methods of descriptive statistics to summarize data.


Which Region and which Channel spent the most?
Which Region and which Channel spent the least?

Descriptive statistics is concerned with Data Summarization Graphs/Charts and tables. The methods of
descriptive statics include Distribution, which deals with each value's frequency, Measures of Central
Tendency and Measures of variability. The most widely used measures of central tendency is Arithmetic
Mean, Median, and Mode.
Mean is defined as the arithmetic average of all observations in the data set.
Median is defined as the middle value in the data set arranged in ascending or descending order.
M o d e is defined as the most frequently occurring value in the distribution; it has the largest frequency.

Measures of Dispersion include Range, IQR, Standard Deviation

R a n g e is the simplest of all measures of dispersion. It is calculated as the difference between maximum
and minimum value in the data set.
Inter-Quartile Range (IQR) is computed on middle 50% of the observations after eliminating the
highest and lowest 25% of observations in a data set that is arranged in ascending order. IQR is less affected
by outliers.
S t a n d a r d d e v i a t i o n is the square root of variance in simple words.

The table below shows the description of the Wholesale


customer dataset:
In t h e t a b l e b e l o w w e c a n s e e s o m e s a m p l e r e c o r d s w h i c h h a s 2
categorical variable and 7 numerical variables.

The data consists of 440 large retailers’ annual spending on 6 different v a r i e t i e s o f


products in 3 different regions (Lisbon, Oporto, Other) and
a c r o s s d i f f e r e n t s a l e s channel (Hotel, Retail)
Th e R e g i o n t h a t h a s s p e n t t h e m o s t i s Other (10677599) a n d t h e
r e g i o n t h a t h a s s p e n t t h e l e a s t i s Oport o (1555088).
T h e C h a n n e l t h a t h a s s p e n t t h e m o s t i s Hotel (7999569) a n d t h e
c h a n n e l t h a t h a s s p e n t t h e l e a s t i s Retail (6619931).

1.2. There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a
detailed justification for your answer.

When we sum up the spending across each channel and region,


we get the total spending across each channel and region in
the following table. The 6 different varieties of items which
include Fresh, Milk, grocery, frozen, detergent paper,
delicatessen spending can be further summarized in the bar
graph.
From the above graph, we can see that at Lisbon most spent
product are Fresh products and the least spent product
is Delicatessen. At Oporto, the most spent product are Fresh
products and least spent products are Delicatessen. In other
category, the most spent product are Fresh products and least
spent product are Delicatessen.

The above graph clearly shows that the most spent product in
retail category is Grocery products and least spent product in
retail category is the Frozen food products. In Hotel category
the most spent product is the Fresh products and least spent
p r o d u c t i s t h e Detergents paper.
1.3 On the basis of the descriptive measure of variability, which item shows
the most inconsistent behaviour? Which items shows the least inconsistent
behaviour?

The common descriptive measures of variability are the range,


IQR, variance, and standard deviation. To check the
inconsistent behaviour of an item we can calculate the
coefficient of variation of each of the variable. The following
pie chart explains how each of the item has performed across
the 3 different locations Lisbon, Oporto and other against both
retail and hotel category.

Th i s t a b l e s h o w s t h a t c o e f f i c i e n t o f v a r i a n c e o f F r e s h
p r o d u c t s i s 1 0 5 . 2 5 % w h i l e t h a t o f Delicatessen is 184.42%.
Therefore, Fres h p r o d u c t s s h o w t h e m o s t i n c o n s i s t e n t
b e h a v i o u r a n d Delicatessen s h o w s t h e l e a s t i n c o n s i s t e n t
behaviour.
This pair plot helps us to understand the relationship between
the 6 food item.
1.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments.

From the boxplot below, we can clearly see that all the six items have outliers.

Outliers are observations in a dataset that don’t fit in some way. Perhaps the most common
or f a m i l i a r t y p e o f o u t l i e r i s t h e o b s e r v a t i o n s t h a t a r e f a r f r o m
the rest of the observations or the centre of mass of
observations. Outliers can skew statistical measures and
data distributions, providing a misleading representation of
the underlying data and relationships. Removing outliers from
data prior to modelling can result in a better fit of the
data and, in turn, more s k i l f u l p r e d i c t i o n .
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective.
From this analysis we can conclude that:
(1) When we calculate total, the business spends the most
on fresh products across different channels and different
regions, so the company needs to ensure that it is driving the
most p r o f i t f r o m t h i s f o o d i t e m .
(2) Since the Delicatessen show the least inconsistent
behaviour, the business should invest m o r e i n t h i s f o o d i t e m
because it is less risky.
(3) Fresh products require more spending, to cut it’s cost, the
wholesale distributor can concentrate more on other food
items like Milk, Grocery, Frozen, Detergents paper and
Delicatesse n.
BUSINESS REPORT ON EDUCATION – POST 12TH STANDARD

CONTEXT:
The dataset Education - Post 12th Standard.csv contains information on various colleges.
You are expected to do a Principal Component Analysis for this case study according to the
instructions given.

2.1 Perform Exploratory Data Analysis [Univariate, Bivariate, and


Multivariate analysis to be performed]. What insight do you draw from the
EDA?
We will explore the Data set and perform the exploratory data analysis on the dataset.
The major topics to be covered are below:

Removing duplicates

Missing value treatment

Outlier Treatment

Normalization and Scaling (Numerical Variables)

Encoding Categorical variables (Dummy Variables)

Univariate Analysis

Bivariate Analysis

As a first step, we will import all the necessary libraries that we think we will be requiring to
perform the EDA.

In this step, we will perform the below operations to check what the data set comprises of.
We will check the below things:

Head of the dataset

Shape of the dataset

Info of the dataset

Summary of the dataset


The table below shows the description of the dataset:

In t h e t a b l e b e l o w w e c a n s e e s o m e s a m p l e r e c o r d s w h i c h h a s 1
categorical variable and 17 numerical variables.

The data consists of 777 student’s entries on 18 different p a r a m e t e r s , s u c h a s :


Names: Names of various university and colleges
Apps: Number of applications received
Accept: Number of applications accepted
Enroll: Number of new students enrolled
Top10perc: Percentage of new students from top 10% of Higher Secondary class
Top25perc: Percentage of new students from top 25% of Higher Secondary class
F.Undergrad: Number of full-time undergraduate students
P.Undergrad: Number of part-time undergraduate students
Outstate: Number of students for whom the particular college or university is Out-of-state
tuition
Room.Board: Cost of Room and board
Books: Estimated book costs for a student
Personal: Estimated personal spending for a student
PhD: Percentage of faculties with Ph.D.’s
Terminal: Percentage of faculties with terminal degree
S.F.Ratio: Student/faculty ratio
perc.alumni: Percentage of alumni who donate
Expend: The Instructional expenditure per student
Grad.Rate: Graduation rate

The below data describes the top five dataset.


This below data describes the last five dataset.

This above information is used to check information about the data and the datatypes of each
respective attributes.
The above data describes the method which will help to see how data has been spread for
the numerical values. We can clearly see the minimum value, mean values, different
percentile values and maximum values.
Outliers are those that go beyond the maximum of a certain data.
In the graph below we see outliers present in those datas.
Looking at the box plots above, it seems that the 16 variables ie. Apps, Accept, Enroll,
Top10perc, F.Undergrad, P.Undergrad, Outstate, Room.Board, Books, Personal, PhD, Terminal,
S.F.Ratio, perc.alumni, Expend, Grad.Rate have outliers present. Accordingly have treated the
outliers too.
There are no duplicate records in the dataset.

There are no missing values in the dataset as you can clearly see below.
In the above graphs, have taken visual presentation of each variable. So that can understand the
data much better.
UNIVARIATE ANALYSIS

In the above graph it shows the apps variable we can say that the Apps parameter is right and
left skewed.
MULTIVARIATE ANALYSIS

In the above plot scatter diagrams are plotted for all the numerical columns in the dataset.
A scatter plot is a visual representation of the degree of correlation between any two columns.
The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the
columns in the data.

PCA (Principal Component Analysis)


As per the question did a principal component analysis on the given data.
The below plot shows the correlation between the variables in the dataset.
0 to 0.35 is “Weak”
0.35 to 0.8/0.85 is “Moderate”
Greater than 0.8/0.85 is “Strong”
In the above plot we can check and see which variables have Weak, Moderate or Strong
correlation with each other.
According to the this correlation we can analyze various aspects of the dataset.
Have changed the columns to rows and got down the number of columns to 12.

The above graph is a visual presentation of all the 12 principal components.


Have changed the columns to rows and got down the number of columns to 5.
Check as to how the original features matter to each principal component. Here we are only
considering the absolute values.
Compare how the original features influence various principal Components.
We need the original scaled features
Check for presence of correlations among the principal components

You might also like