SMDM Project
SMDM Project
Descriptive statistics is concerned with Data Summarization Graphs/Charts and tables. The methods of
descriptive statics include Distribution, which deals with each value's frequency, Measures of Central
Tendency and Measures of variability. The most widely used measures of central tendency is Arithmetic
Mean, Median, and Mode.
Mean is defined as the arithmetic average of all observations in the data set.
Median is defined as the middle value in the data set arranged in ascending or descending order.
M o d e is defined as the most frequently occurring value in the distribution; it has the largest frequency.
R a n g e is the simplest of all measures of dispersion. It is calculated as the difference between maximum
and minimum value in the data set.
Inter-Quartile Range (IQR) is computed on middle 50% of the observations after eliminating the
highest and lowest 25% of observations in a data set that is arranged in ascending order. IQR is less affected
by outliers.
S t a n d a r d d e v i a t i o n is the square root of variance in simple words.
1.2. There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a
detailed justification for your answer.
The above graph clearly shows that the most spent product in
retail category is Grocery products and least spent product in
retail category is the Frozen food products. In Hotel category
the most spent product is the Fresh products and least spent
p r o d u c t i s t h e Detergents paper.
1.3 On the basis of the descriptive measure of variability, which item shows
the most inconsistent behaviour? Which items shows the least inconsistent
behaviour?
Th i s t a b l e s h o w s t h a t c o e f f i c i e n t o f v a r i a n c e o f F r e s h
p r o d u c t s i s 1 0 5 . 2 5 % w h i l e t h a t o f Delicatessen is 184.42%.
Therefore, Fres h p r o d u c t s s h o w t h e m o s t i n c o n s i s t e n t
b e h a v i o u r a n d Delicatessen s h o w s t h e l e a s t i n c o n s i s t e n t
behaviour.
This pair plot helps us to understand the relationship between
the 6 food item.
1.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments.
From the boxplot below, we can clearly see that all the six items have outliers.
Outliers are observations in a dataset that don’t fit in some way. Perhaps the most common
or f a m i l i a r t y p e o f o u t l i e r i s t h e o b s e r v a t i o n s t h a t a r e f a r f r o m
the rest of the observations or the centre of mass of
observations. Outliers can skew statistical measures and
data distributions, providing a misleading representation of
the underlying data and relationships. Removing outliers from
data prior to modelling can result in a better fit of the
data and, in turn, more s k i l f u l p r e d i c t i o n .
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective.
From this analysis we can conclude that:
(1) When we calculate total, the business spends the most
on fresh products across different channels and different
regions, so the company needs to ensure that it is driving the
most p r o f i t f r o m t h i s f o o d i t e m .
(2) Since the Delicatessen show the least inconsistent
behaviour, the business should invest m o r e i n t h i s f o o d i t e m
because it is less risky.
(3) Fresh products require more spending, to cut it’s cost, the
wholesale distributor can concentrate more on other food
items like Milk, Grocery, Frozen, Detergents paper and
Delicatesse n.
BUSINESS REPORT ON EDUCATION – POST 12TH STANDARD
CONTEXT:
The dataset Education - Post 12th Standard.csv contains information on various colleges.
You are expected to do a Principal Component Analysis for this case study according to the
instructions given.
Removing duplicates
Outlier Treatment
Univariate Analysis
Bivariate Analysis
As a first step, we will import all the necessary libraries that we think we will be requiring to
perform the EDA.
In this step, we will perform the below operations to check what the data set comprises of.
We will check the below things:
In t h e t a b l e b e l o w w e c a n s e e s o m e s a m p l e r e c o r d s w h i c h h a s 1
categorical variable and 17 numerical variables.
This above information is used to check information about the data and the datatypes of each
respective attributes.
The above data describes the method which will help to see how data has been spread for
the numerical values. We can clearly see the minimum value, mean values, different
percentile values and maximum values.
Outliers are those that go beyond the maximum of a certain data.
In the graph below we see outliers present in those datas.
Looking at the box plots above, it seems that the 16 variables ie. Apps, Accept, Enroll,
Top10perc, F.Undergrad, P.Undergrad, Outstate, Room.Board, Books, Personal, PhD, Terminal,
S.F.Ratio, perc.alumni, Expend, Grad.Rate have outliers present. Accordingly have treated the
outliers too.
There are no duplicate records in the dataset.
There are no missing values in the dataset as you can clearly see below.
In the above graphs, have taken visual presentation of each variable. So that can understand the
data much better.
UNIVARIATE ANALYSIS
In the above graph it shows the apps variable we can say that the Apps parameter is right and
left skewed.
MULTIVARIATE ANALYSIS
In the above plot scatter diagrams are plotted for all the numerical columns in the dataset.
A scatter plot is a visual representation of the degree of correlation between any two columns.
The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the
columns in the data.