0% found this document useful (0 votes)
107 views23 pages

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

This document provides a summary of analyzing sales data from a furniture retailer. It includes: 1. Importing libraries and loading the sales data. Visualizing the data to analyze trends by product category, sub-category, region, and year. 2. Developing a predictive forecasting model for furniture sales. The time series data is preprocessed, decomposed to analyze trends and seasonality, and different ARIMA models are fit to the data. 3. The best performing ARIMA model is identified as SARIMAX(1,1,1)x(1,1,0,12) based on having the lowest AIC value, and will be used to forecast future furniture sales.

Uploaded by

sumaira khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views23 pages

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

This document provides a summary of analyzing sales data from a furniture retailer. It includes: 1. Importing libraries and loading the sales data. Visualizing the data to analyze trends by product category, sub-category, region, and year. 2. Developing a predictive forecasting model for furniture sales. The time series data is preprocessed, decomposed to analyze trends and seasonality, and different ARIMA models are fit to the data. 3. The best performing ARIMA model is identified as SARIMAX(1,1,1)x(1,1,0,12) based on having the lowest AIC value, and will be used to forecast future furniture sales.

Uploaded by

sumaira khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

PART II

First of all here we are going to import all required libraries and
them going to import or we can say load data and then going to
perform visualization on it.

Python’s code is below through which we made all these


visualization and analysis on data plus here the file attached
below double click on it to open or save it ( files ext is ipynb)

anjaliassignmnet.ipy
nb

 IMPPORTING LIBRARIES

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder

 READING DATA FROM FILES

df=pd.read_excel('Sample - Superstore.xls')
df.head()
 CHECKING ROWS AND COLUMNS OF DATA

df.shape

 COLUMN NAMES OF TABLE


df.columns

 CHECKING DATA TYPES OF COLUMNS


df.dtypes

 CHECKING NULL VALUES


df.isnull().sum()

 Dropping Row ID column and assigning to df

df=df.drop('Row ID',axis=1)
df.head()
NOTE : Clearly the data is for US country only, so we can drop the
'Country' column as we dont need any analysis to be done based on it.
df['Country'].value_counts()
#dropping Country column
df=df.drop('Country',axis=1)
df.head()

WE CAN ANAKYZE THE DATA IN FURTHER 3 DIFFERENT WAYS

1. PRODUCT LEVEL ANALYSIS


2. CUSTOMER LEVEL ANALYSIS
3. ORDER LEVEL ANALYSIS

df['Category'].unique()

#number of products in each category


df['Category'].value_counts()

#number of Sub-categories products are divided.


df['Sub-Category'].nunique()

#number of products in each sub-category


df['Sub-Category'].value_counts()

FIRST VISULAIZATION

plt.figure(figsize=(12,10))
df['Sub-Category'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()
NOTE: Highest profit is earned in Copiers while Selling price for
Chairs and Phones is extremely high compared to other products.
Another interesting fact- people dont prefer to buy Tables and
Bookcases from Superstore. Hence these departments are in loss.
SECOND

df.groupby('Sub-Category')['Profit','Sales'].agg(['sum']).plot.bar()
plt.title('Total Profit and Sales per Sub-Category')
# plt.legend('Profit')
# plt.legend('Sales')
plt.show()
THIRD

DISTRIBUTION OF TOP 10 PRODUCTS.

plt.figure(figsize=(12,10))
df['Product Name'].value_counts().head(10).plot.pie(autopct="%1.1f
%%")

NOTE : People residing in Western part of US tend to order more


from superstore.
FOURTH :

Count of Sub-Category region wise

plt.figure(figsize=(15,8))
sns.countplot(x="Sub-Category", hue="Region", data=df)
plt.show()

To understand the data better. Lets create some new


columns like Cost,Profit%
df['Cost']=df['Sales']-df['Profit']
df['Cost'].head()

df['Profit %']=(df['Profit']/df['Cost'])*100

#Profit Percentage of first 5 product names


df.iloc[[0,1,2,3,4],[14,20]]
#Products with high Profit Percentage
df.sort_values(['Profit %','Product
Name'],ascending=False).groupby('Profit %').head(5)

LETS LOOK AT THE DATA WRT TO CUSTOMER LEVEL

df['Customer ID'].nunique()

#Top 10 customers who order frequently


df_top10=df['Customer Name'].value_counts().head(10)
df_top10

fig=plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
s=sns.countplot('Segment', data = df)
for s in ax.patches:
ax.annotate('{:.0f}'.format(s.get_height()), (s.get_x()+0.15,
s.get_height()+1))
plt.show()

#Top 20 Customers who benefitted the store


sortedTop20 = df.sort_values(['Profit'],
ascending=False).head(20)
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
p = sns.barplot(x='Customer Name',
y='Profit',hue='State',palette='Set1', data=sortedTop20,
ax=ax)
ax.set_title("Top 20 profitable Customers")
ax.set_xticklabels(p.get_xticklabels(), rotation=75)
plt.tight_layout()
plt.show()

Lets do some do some Analysis with Order details of the


data.

#number of unique orders


df['Order ID'].nunique()

#Calculating the time taken for an order to ship and


converting the no. of days in int format
df['Shipment Duration']=(pd.to_datetime(df['Ship Date'])-
pd.to_datetime(df['Order Date'])).dt.days
df['Shipment Duration']

df.iloc[:,[0,3,21]]

Lets find out some more details about each Customer like total
products purchased,Products they purchase,First Purchase Date,Last
Purchase Date,Location from where the Customer placed an order.
#creating function and appending customer and order info
to it.
def agg_customer(x):
d = []
d.append(x['Order ID'].count())
d.append(x['Sales'].sum())
d.append(x['Profit %'].mean())
d.append(pd.to_datetime(x['Order Date']).min())
d.append(pd.to_datetime(x['Order Date']).max())
d.append(x['Product Name'].unique())
d.append(x['City'].unique())
return pd.Series(d,
index=['#Purchases','Total_Sales','Average Profit %
gained','First_Purchase_Date','Latest_Purchase_Date','Produ
cts Purchased','Location_Count'])

#grouping based on Customer ID and applying the function


we created above
df_agg = df.groupby('Customer ID').apply(agg_customer)
df_agg

#extracting the year of order


df['order year']=df['Order Date'].dt.year
df['order year'].head()

#Calculating Profit gained in each Category


fig=plt.figure(figsize=(16,8))
ax = fig.add_subplot(111)
sns.barplot('order year','Profit %',hue='Sub-
Category',palette='Paired',data=df)
for o in ax.patches:
ax.annotate('{:.0f}'.format(o.get_height()), (o.get_x()+0.15,
o.get_height()+1))
plt.show()

NOTE : Sales of the store has increased every year resulting


in high profit margin by the end of 2017.
#Sales per year
df.groupby('order year')['Sales','Profit %'].agg(['sum']).plot.bar()
plt.title('Year wise Total Sales & % of profit gained')
Predictive forecasting model

Complete code is available in file below double click on icon to open it.

( ext is ipynb)

MODELING.ipynb

IMPORTING LIBRARIES

import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib

matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

There are several categories in the Superstore sales data, we start from
time series analysis and forecasting for furniture sales.

df = pd.read_excel('Sample - Superstore.xls')
furniture = df.loc[df['Category'] == 'Furniture']

We have a good 4-year furniture sales data.

furniture['Order Date'].min(), furniture['Order Date'].max()

Data Preprocessing

This step includes removing columns we do not need, check missing


values, aggregate sales by date and so on.

cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
furniture = furniture.sort_values('Order Date')

furniture.isnull().sum()

furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()

Indexing with Time Series Data

furniture = furniture.set_index('Order Date')


furniture.index
Our current datetime data can be tricky to work with, therefore, we will
use the averages daily sales value for that month instead, and we are
using the start of each month as the timestamp.

y = furniture['Sales'].resample('MS').mean()

y['2016':]

Visualizing Furniture Sales Time Series Data

y.plot(figsize=(15, 6))
plt.show()

Some distinguishable patterns appear when we plot the data. The time-
series has seasonality pattern, such as sales are always low at the
beginning of the year and high at the end of the year. There is always an
upward trend within any single year with a couple of low months in the
mid of the year. We can also visualize our data using a method called
time-series decomposition that allows us to decompose our time series
into three distinct components: trend, seasonality, and noise.

from pylab import rcParams


rcParams['figure.figsize'] = 18, 8

decomposition = sm.tsa.seasonal_decompose(y, model='additive')


fig = decomposition.plot()
plt.show()

The plot above clearly shows that the sales of furniture is unstable,
along with its obvious seasonality.

Time series forecasting with ARIMA


We are going to apply one of the most commonly used method for
time-series forecasting, known as ARIMA, which stands for
Autoregressive Integrated Moving Average. ARIMA models are
denoted with the notation ARIMA(p, d, q). These three parameters
account for seasonality, trend, and noise in data:
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d,
q))]

print('Examples of parameter combinations for Seasonal ARIMA...')


print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
This step is parameter Selection for our furniture’s sales ARIMA Time
Series Model. Our goal here is to use a “grid search” to find the optimal
set of parameters that yields the best performance for our model.

for param in pdq:


for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)

results = mod.fit()

print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal,


results.aic))
except:
continue
The above output suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the
lowest AIC value of 297.78. Therefore we should consider this to be
optimal option.

Fitting the ARIMA model

mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)

results = mod.fit()

print(results.summary().tables[1])

results.plot_diagnostics(figsize=(16, 8))
plt.show()
NOTE : our model diagnostics suggests that the model residuals
are near normally distributed.
Validating forecasts

To help us understand the accuracy of our forecasts, we


compare predicted sales to real sales of the time series, and we
set forecasts to start at 2017–01–01 to the end of the data.

pred = results.get_prediction(start=pd.to_datetime('2017-01-
01'), dynamic=False)
pred_ci = pred.conf_int()

ax = y['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead
Forecast', alpha=.7, figsize=(14, 7))

ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)

ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()

plt.show()

NOTE : The line plot is showing the observed values compared


to the rolling forecast predictions. Overall, our forecasts align
with the true values very well, showing an upward trend starts
from the beginning of the year and captured the seasonality
toward the end of the year.

Producing and visualizing forecasts


pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()

ax = y.plot(label='observed', figsize=(14, 7))


pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')

plt.legend()
plt.show()

NOTE : Our model clearly captured furniture sales seasonality.


As we forecast further out into the future, it is natural for us to
become less confident in our values. This is reflected by the
confidence intervals generated by our model, which grow
larger as we move further out into the future.
The above time series analysis for furniture makes me curious
about other categories, and how do they compare with each
other over time. Therefore, we are going to compare time
series of furniture and office supplier.

You might also like