0% found this document useful (0 votes)

107 views23 pages

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

This document provides a summary of analyzing sales data from a furniture retailer. It includes: 1. Importing libraries and loading the sales data. Visualizing the data to analyze trends by product category, sub-category, region, and year. 2. Developing a predictive forecasting model for furniture sales. The time series data is preprocessed, decomposed to analyze trends and seasonality, and different ARIMA models are fit to the data. 3. The best performing ARIMA model is identified as SARIMAX(1,1,1)x(1,1,0,12) based on having the lowest AIC value, and will be used to forecast future furniture sales.

Uploaded by

sumaira khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views23 pages

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

Uploaded by

sumaira khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

PART II

First of all here we are going to import all required libraries and
them going to import or we can say load data and then going to
perform visualization on it.

Python’s code is below through which we made all these

visualization and analysis on data plus here the file attached
below double click on it to open or save it ( files ext is ipynb)

anjaliassignmnet.ipy
nb

 IMPPORTING LIBRARIES

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder

 READING DATA FROM FILES

df=pd.read_excel('Sample - Superstore.xls')
df.head()
 CHECKING ROWS AND COLUMNS OF DATA

df.shape

 COLUMN NAMES OF TABLE

df.columns

 CHECKING DATA TYPES OF COLUMNS

df.dtypes

 CHECKING NULL VALUES

df.isnull().sum()

 Dropping Row ID column and assigning to df

df=df.drop('Row ID',axis=1)
df.head()
NOTE : Clearly the data is for US country only, so we can drop the
'Country' column as we dont need any analysis to be done based on it.
df['Country'].value_counts()
#dropping Country column
df=df.drop('Country',axis=1)
df.head()

WE CAN ANAKYZE THE DATA IN FURTHER 3 DIFFERENT WAYS

1. PRODUCT LEVEL ANALYSIS

2. CUSTOMER LEVEL ANALYSIS
3. ORDER LEVEL ANALYSIS

df['Category'].unique()

#number of products in each category

df['Category'].value_counts()

#number of Sub-categories products are divided.

df['Sub-Category'].nunique()

#number of products in each sub-category

df['Sub-Category'].value_counts()

FIRST VISULAIZATION

plt.figure(figsize=(12,10))
df['Sub-Category'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()
NOTE: Highest profit is earned in Copiers while Selling price for
Chairs and Phones is extremely high compared to other products.
Another interesting fact- people dont prefer to buy Tables and
Bookcases from Superstore. Hence these departments are in loss.
SECOND

df.groupby('Sub-Category')['Profit','Sales'].agg(['sum']).plot.bar()
plt.title('Total Profit and Sales per Sub-Category')
# plt.legend('Profit')
# plt.legend('Sales')
plt.show()
THIRD

DISTRIBUTION OF TOP 10 PRODUCTS.

plt.figure(figsize=(12,10))
df['Product Name'].value_counts().head(10).plot.pie(autopct="%1.1f
%%")

NOTE : People residing in Western part of US tend to order more

from superstore.
FOURTH :

Count of Sub-Category region wise

plt.figure(figsize=(15,8))
sns.countplot(x="Sub-Category", hue="Region", data=df)
plt.show()

To understand the data better. Lets create some new

columns like Cost,Profit%
df['Cost']=df['Sales']-df['Profit']
df['Cost'].head()

df['Profit %']=(df['Profit']/df['Cost'])*100

#Profit Percentage of first 5 product names

df.iloc[[0,1,2,3,4],[14,20]]
#Products with high Profit Percentage
df.sort_values(['Profit %','Product
Name'],ascending=False).groupby('Profit %').head(5)

LETS LOOK AT THE DATA WRT TO CUSTOMER LEVEL

df['Customer ID'].nunique()

#Top 10 customers who order frequently

df_top10=df['Customer Name'].value_counts().head(10)
df_top10

fig=plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
s=sns.countplot('Segment', data = df)
for s in ax.patches:
ax.annotate('{:.0f}'.format(s.get_height()), (s.get_x()+0.15,
s.get_height()+1))
plt.show()

#Top 20 Customers who benefitted the store

sortedTop20 = df.sort_values(['Profit'],
ascending=False).head(20)
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
p = sns.barplot(x='Customer Name',
y='Profit',hue='State',palette='Set1', data=sortedTop20,
ax=ax)
ax.set_title("Top 20 profitable Customers")
ax.set_xticklabels(p.get_xticklabels(), rotation=75)
plt.tight_layout()
plt.show()

Lets do some do some Analysis with Order details of the

data.

#number of unique orders

df['Order ID'].nunique()

#Calculating the time taken for an order to ship and

converting the no. of days in int format
df['Shipment Duration']=(pd.to_datetime(df['Ship Date'])-
pd.to_datetime(df['Order Date'])).dt.days
df['Shipment Duration']

df.iloc[:,[0,3,21]]

Lets find out some more details about each Customer like total
products purchased,Products they purchase,First Purchase Date,Last
Purchase Date,Location from where the Customer placed an order.
#creating function and appending customer and order info
to it.
def agg_customer(x):
d = []
d.append(x['Order ID'].count())
d.append(x['Sales'].sum())
d.append(x['Profit %'].mean())
d.append(pd.to_datetime(x['Order Date']).min())
d.append(pd.to_datetime(x['Order Date']).max())
d.append(x['Product Name'].unique())
d.append(x['City'].unique())
return pd.Series(d,
index=['#Purchases','Total_Sales','Average Profit %
gained','First_Purchase_Date','Latest_Purchase_Date','Produ
cts Purchased','Location_Count'])

#grouping based on Customer ID and applying the function

we created above
df_agg = df.groupby('Customer ID').apply(agg_customer)
df_agg

#extracting the year of order

df['order year']=df['Order Date'].dt.year
df['order year'].head()

#Calculating Profit gained in each Category

fig=plt.figure(figsize=(16,8))
ax = fig.add_subplot(111)
sns.barplot('order year','Profit %',hue='Sub-
Category',palette='Paired',data=df)
for o in ax.patches:
ax.annotate('{:.0f}'.format(o.get_height()), (o.get_x()+0.15,
o.get_height()+1))
plt.show()

NOTE : Sales of the store has increased every year resulting

in high profit margin by the end of 2017.
#Sales per year
df.groupby('order year')['Sales','Profit %'].agg(['sum']).plot.bar()
plt.title('Year wise Total Sales & % of profit gained')
Predictive forecasting model

Complete code is available in file below double click on icon to open it.

( ext is ipynb)

MODELING.ipynb

IMPORTING LIBRARIES

import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib

matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

There are several categories in the Superstore sales data, we start from
time series analysis and forecasting for furniture sales.

df = pd.read_excel('Sample - Superstore.xls')
furniture = df.loc[df['Category'] == 'Furniture']

We have a good 4-year furniture sales data.

furniture['Order Date'].min(), furniture['Order Date'].max()

Data Preprocessing

This step includes removing columns we do not need, check missing

values, aggregate sales by date and so on.

cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
furniture = furniture.sort_values('Order Date')

furniture.isnull().sum()

furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()

Indexing with Time Series Data

furniture = furniture.set_index('Order Date')

furniture.index
Our current datetime data can be tricky to work with, therefore, we will
use the averages daily sales value for that month instead, and we are
using the start of each month as the timestamp.

y = furniture['Sales'].resample('MS').mean()

y['2016':]

Visualizing Furniture Sales Time Series Data

y.plot(figsize=(15, 6))
plt.show()

Some distinguishable patterns appear when we plot the data. The time-
series has seasonality pattern, such as sales are always low at the
beginning of the year and high at the end of the year. There is always an
upward trend within any single year with a couple of low months in the
mid of the year. We can also visualize our data using a method called
time-series decomposition that allows us to decompose our time series
into three distinct components: trend, seasonality, and noise.

from pylab import rcParams

rcParams['figure.figsize'] = 18, 8

decomposition = sm.tsa.seasonal_decompose(y, model='additive')

fig = decomposition.plot()
plt.show()

The plot above clearly shows that the sales of furniture is unstable,
along with its obvious seasonality.

Time series forecasting with ARIMA

We are going to apply one of the most commonly used method for
time-series forecasting, known as ARIMA, which stands for
Autoregressive Integrated Moving Average. ARIMA models are
denoted with the notation ARIMA(p, d, q). These three parameters
account for seasonality, trend, and noise in data:
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d,
q))]

print('Examples of parameter combinations for Seasonal ARIMA...')

print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
This step is parameter Selection for our furniture’s sales ARIMA Time
Series Model. Our goal here is to use a “grid search” to find the optimal
set of parameters that yields the best performance for our model.

for param in pdq:

for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)

results = mod.fit()

print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal,

results.aic))
except:
continue
The above output suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the
lowest AIC value of 297.78. Therefore we should consider this to be
optimal option.

Fitting the ARIMA model

mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)

results = mod.fit()

print(results.summary().tables[1])

results.plot_diagnostics(figsize=(16, 8))
plt.show()
NOTE : our model diagnostics suggests that the model residuals
are near normally distributed.
Validating forecasts

To help us understand the accuracy of our forecasts, we

compare predicted sales to real sales of the time series, and we
set forecasts to start at 2017–01–01 to the end of the data.

pred = results.get_prediction(start=pd.to_datetime('2017-01-
01'), dynamic=False)
pred_ci = pred.conf_int()

ax = y['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead
Forecast', alpha=.7, figsize=(14, 7))

ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)

ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()

plt.show()

NOTE : The line plot is showing the observed values compared

to the rolling forecast predictions. Overall, our forecasts align
with the true values very well, showing an upward trend starts
from the beginning of the year and captured the seasonality
toward the end of the year.

Producing and visualizing forecasts

pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()

ax = y.plot(label='observed', figsize=(14, 7))

pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')

plt.legend()
plt.show()

NOTE : Our model clearly captured furniture sales seasonality.

As we forecast further out into the future, it is natural for us to
become less confident in our values. This is reflected by the
confidence intervals generated by our model, which grow
larger as we move further out into the future.
The above time series analysis for furniture makes me curious
about other categories, and how do they compare with each
other over time. Therefore, we are going to compare time
series of furniture and office supplier.

KCU401-C Keeler Cryomatic Service Manual
100% (1)
KCU401-C Keeler Cryomatic Service Manual
25 pages
Mosdorfer Catalog Clamps
No ratings yet
Mosdorfer Catalog Clamps
44 pages
Supermarket Sales Analysis and Prediction
100% (1)
Supermarket Sales Analysis and Prediction
34 pages
Dja2500 - 4000 Service Manual PDF
73% (22)
Dja2500 - 4000 Service Manual PDF
38 pages
Business Report TSF - Rose DataSet
100% (4)
Business Report TSF - Rose DataSet
52 pages
Dirty Dreams To Tell Your Boyfriend: Click Here To Download
60% (5)
Dirty Dreams To Tell Your Boyfriend: Click Here To Download
3 pages
Time Series Using Python
No ratings yet
Time Series Using Python
18 pages
Supermart Grocery Sales Analysis
No ratings yet
Supermart Grocery Sales Analysis
8 pages
The Big Picture B2 Intermediate
No ratings yet
The Big Picture B2 Intermediate
170 pages
Dissrtatn Cmplte PDF
No ratings yet
Dissrtatn Cmplte PDF
162 pages
Seasonal Data - Seasonal ARIMA Models
No ratings yet
Seasonal Data - Seasonal ARIMA Models
13 pages
Botany in Berlin
100% (1)
Botany in Berlin
285 pages
Visualisation All
0% (1)
Visualisation All
70 pages
BIDA Practical Print
No ratings yet
BIDA Practical Print
56 pages
Supermarket Sales Analysis Project
No ratings yet
Supermarket Sales Analysis Project
8 pages
Theda Weberlucks Electroacoustic Voices in Vocal Performance Art A Gender Issue 1
No ratings yet
Theda Weberlucks Electroacoustic Voices in Vocal Performance Art A Gender Issue 1
10 pages
Module 4 - Provide Valet Services To Guest
No ratings yet
Module 4 - Provide Valet Services To Guest
5 pages
AGET 1320 - Fencing Tools & Techniques
No ratings yet
AGET 1320 - Fencing Tools & Techniques
63 pages
Oddstudents
No ratings yet
Oddstudents
35 pages
Practical File Class 12 2025-26
No ratings yet
Practical File Class 12 2025-26
19 pages
Introduction To Industrial Relations: Lecture 1& 2
No ratings yet
Introduction To Industrial Relations: Lecture 1& 2
54 pages
Module 2 Interpersonal Communication Activity 2
No ratings yet
Module 2 Interpersonal Communication Activity 2
1 page
Service Catalog
No ratings yet
Service Catalog
3 pages
Prameet (12a) (5728)
No ratings yet
Prameet (12a) (5728)
33 pages
Lab Manual 4
No ratings yet
Lab Manual 4
23 pages
Divyanshi 05401172023 Ds Practical
No ratings yet
Divyanshi 05401172023 Ds Practical
18 pages
Case Study Crude Oil Production Forecasting
No ratings yet
Case Study Crude Oil Production Forecasting
27 pages
INDEX
No ratings yet
INDEX
16 pages
Python For Business Decision Making Asm2
No ratings yet
Python For Business Decision Making Asm2
21 pages
Practical D.V
No ratings yet
Practical D.V
13 pages
Unit - 6 Promotion Decisions: Jacqueline
No ratings yet
Unit - 6 Promotion Decisions: Jacqueline
22 pages
Data Collection and Data Cleaning: Next Connect To The Drive
No ratings yet
Data Collection and Data Cleaning: Next Connect To The Drive
16 pages
PRJ Sales Forecasting
No ratings yet
PRJ Sales Forecasting
22 pages
NumPy and Pandas Step
No ratings yet
NumPy and Pandas Step
9 pages
Time Series Forecastingdocx - 1705073224
No ratings yet
Time Series Forecastingdocx - 1705073224
16 pages
An End-to-End Project On Time Series Analysis and Forecasting With Python
No ratings yet
An End-to-End Project On Time Series Analysis and Forecasting With Python
19 pages
Technologyname Phase2
No ratings yet
Technologyname Phase2
20 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Task 6
No ratings yet
Task 6
14 pages
Cap 793
No ratings yet
Cap 793
17 pages
2020 Hwang, Effects, Multi-Level Concept Mapping-Based Question
No ratings yet
2020 Hwang, Effects, Multi-Level Concept Mapping-Based Question
17 pages
Profitanalysis
No ratings yet
Profitanalysis
18 pages
Pyhtonpractice Questions
No ratings yet
Pyhtonpractice Questions
5 pages
London Water Case Study
No ratings yet
London Water Case Study
13 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
The Geisha Memory 2
No ratings yet
The Geisha Memory 2
25 pages
Poetry Mid Test
No ratings yet
Poetry Mid Test
4 pages
Forage 1
No ratings yet
Forage 1
9 pages
Time Series Analysis - CheatSheet
No ratings yet
Time Series Analysis - CheatSheet
10 pages
DVT Exp - 7
No ratings yet
DVT Exp - 7
11 pages
4 Transpiration
No ratings yet
4 Transpiration
15 pages
Retail Analysis Walmart
No ratings yet
Retail Analysis Walmart
18 pages
Data Mining Practicals Complete
No ratings yet
Data Mining Practicals Complete
13 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
23 pages
Supermarket Sales Data Analysis
No ratings yet
Supermarket Sales Data Analysis
6 pages
Unit 1 Pandas - Charts
No ratings yet
Unit 1 Pandas - Charts
18 pages
Mathematical Modeling of A Battery Energy Storage System in Grid Forming Mode
No ratings yet
Mathematical Modeling of A Battery Energy Storage System in Grid Forming Mode
6 pages
Document 11
No ratings yet
Document 11
6 pages
Wa0002.
No ratings yet
Wa0002.
4 pages
Set B
No ratings yet
Set B
8 pages
Matplotlib
No ratings yet
Matplotlib
7 pages
Saikat Dey Data Science Project
No ratings yet
Saikat Dey Data Science Project
14 pages
Customer Segmentation PDF
No ratings yet
Customer Segmentation PDF
18 pages
Python Class 11 Test Gen 002
No ratings yet
Python Class 11 Test Gen 002
6 pages
Final
No ratings yet
Final
10 pages
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
No ratings yet
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
9 pages
Notes 20241025083428
No ratings yet
Notes 20241025083428
4 pages
Assignment: Master in Business Administration
No ratings yet
Assignment: Master in Business Administration
18 pages
Time Series Analysis 1718649022
No ratings yet
Time Series Analysis 1718649022
5 pages
Statistics Project SEM1 Notes
No ratings yet
Statistics Project SEM1 Notes
5 pages
Matplotlib Pandas Guide
No ratings yet
Matplotlib Pandas Guide
7 pages
Time Series Analysis
No ratings yet
Time Series Analysis
5 pages
Fatigue Strength
No ratings yet
Fatigue Strength
7 pages
Sales Analysis Assessment
No ratings yet
Sales Analysis Assessment
2 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
This Study Resource Was: Supply Chain Management
No ratings yet
This Study Resource Was: Supply Chain Management
4 pages
Data Visulization
No ratings yet
Data Visulization
2 pages
Assignment 1 Supplementary
No ratings yet
Assignment 1 Supplementary
5 pages
Matplotlib
No ratings yet
Matplotlib
4 pages
Amazon Sales Analysis
No ratings yet
Amazon Sales Analysis
3 pages
Time Series Analysis Time Series Analysis
No ratings yet
Time Series Analysis Time Series Analysis
5 pages
(STUDI KASUS: Yayasan Sosial Dana Priangan) : Perancangan Sistem Informasi Museum Budaya Tionghoa Bandung Berbasis Web
No ratings yet
(STUDI KASUS: Yayasan Sosial Dana Priangan) : Perancangan Sistem Informasi Museum Budaya Tionghoa Bandung Berbasis Web
6 pages
The Technical Aspects When Using BENDER Communication Solutions
No ratings yet
The Technical Aspects When Using BENDER Communication Solutions
4 pages
Summative For Week 1 & 2 Statistics
No ratings yet
Summative For Week 1 & 2 Statistics
3 pages
2011 05 08 The Backslider
No ratings yet
2011 05 08 The Backslider
2 pages
Pothys Study
No ratings yet
Pothys Study
6 pages
Test 03a
No ratings yet
Test 03a
4 pages
Lm3622 Aplication Circuit
No ratings yet
Lm3622 Aplication Circuit
2 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

Uploaded by

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

Uploaded by

PART II

Python’s code is below through which we made all these

 READING DATA FROM FILES

 COLUMN NAMES OF TABLE

 CHECKING DATA TYPES OF COLUMNS

 CHECKING NULL VALUES

 Dropping Row ID column and assigning to df

WE CAN ANAKYZE THE DATA IN FURTHER 3 DIFFERENT WAYS

1. PRODUCT LEVEL ANALYSIS

#number of products in each category

#number of Sub-categories products are divided.

#number of products in each sub-category

DISTRIBUTION OF TOP 10 PRODUCTS.

NOTE : People residing in Western part of US tend to order more

Count of Sub-Category region wise

To understand the data better. Lets create some new

#Profit Percentage of first 5 product names

LETS LOOK AT THE DATA WRT TO CUSTOMER LEVEL

#Top 10 customers who order frequently

#Top 20 Customers who benefitted the store

Lets do some do some Analysis with Order details of the

#number of unique orders

#Calculating the time taken for an order to ship and

#grouping based on Customer ID and applying the function

#extracting the year of order

#Calculating Profit gained in each Category

NOTE : Sales of the store has increased every year resulting

We have a good 4-year furniture sales data.

furniture['Order Date'].min(), furniture['Order Date'].max()

This step includes removing columns we do not need, check missing

furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()

Indexing with Time Series Data

furniture = furniture.set_index('Order Date')

Visualizing Furniture Sales Time Series Data

from pylab import rcParams

decomposition = sm.tsa.seasonal_decompose(y, model='additive')

Time series forecasting with ARIMA

print('Examples of parameter combinations for Seasonal ARIMA...')

for param in pdq:

print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal,

Fitting the ARIMA model

To help us understand the accuracy of our forecasts, we

NOTE : The line plot is showing the observed values compared

Producing and visualizing forecasts

ax = y.plot(label='observed', figsize=(14, 7))

NOTE : Our model clearly captured furniture sales seasonality.

You might also like