DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB
DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB
First of all here we are going to import all required libraries and
them going to import or we can say load data and then going to
perform visualization on it.
anjaliassignmnet.ipy
nb
IMPPORTING LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
df=pd.read_excel('Sample - Superstore.xls')
df.head()
CHECKING ROWS AND COLUMNS OF DATA
df.shape
df=df.drop('Row ID',axis=1)
df.head()
NOTE : Clearly the data is for US country only, so we can drop the
'Country' column as we dont need any analysis to be done based on it.
df['Country'].value_counts()
#dropping Country column
df=df.drop('Country',axis=1)
df.head()
df['Category'].unique()
FIRST VISULAIZATION
plt.figure(figsize=(12,10))
df['Sub-Category'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()
NOTE: Highest profit is earned in Copiers while Selling price for
Chairs and Phones is extremely high compared to other products.
Another interesting fact- people dont prefer to buy Tables and
Bookcases from Superstore. Hence these departments are in loss.
SECOND
df.groupby('Sub-Category')['Profit','Sales'].agg(['sum']).plot.bar()
plt.title('Total Profit and Sales per Sub-Category')
# plt.legend('Profit')
# plt.legend('Sales')
plt.show()
THIRD
plt.figure(figsize=(12,10))
df['Product Name'].value_counts().head(10).plot.pie(autopct="%1.1f
%%")
plt.figure(figsize=(15,8))
sns.countplot(x="Sub-Category", hue="Region", data=df)
plt.show()
df['Profit %']=(df['Profit']/df['Cost'])*100
df['Customer ID'].nunique()
fig=plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
s=sns.countplot('Segment', data = df)
for s in ax.patches:
ax.annotate('{:.0f}'.format(s.get_height()), (s.get_x()+0.15,
s.get_height()+1))
plt.show()
df.iloc[:,[0,3,21]]
Lets find out some more details about each Customer like total
products purchased,Products they purchase,First Purchase Date,Last
Purchase Date,Location from where the Customer placed an order.
#creating function and appending customer and order info
to it.
def agg_customer(x):
d = []
d.append(x['Order ID'].count())
d.append(x['Sales'].sum())
d.append(x['Profit %'].mean())
d.append(pd.to_datetime(x['Order Date']).min())
d.append(pd.to_datetime(x['Order Date']).max())
d.append(x['Product Name'].unique())
d.append(x['City'].unique())
return pd.Series(d,
index=['#Purchases','Total_Sales','Average Profit %
gained','First_Purchase_Date','Latest_Purchase_Date','Produ
cts Purchased','Location_Count'])
Complete code is available in file below double click on icon to open it.
( ext is ipynb)
MODELING.ipynb
IMPORTING LIBRARIES
import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
There are several categories in the Superstore sales data, we start from
time series analysis and forecasting for furniture sales.
df = pd.read_excel('Sample - Superstore.xls')
furniture = df.loc[df['Category'] == 'Furniture']
Data Preprocessing
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
furniture = furniture.sort_values('Order Date')
furniture.isnull().sum()
y = furniture['Sales'].resample('MS').mean()
y['2016':]
y.plot(figsize=(15, 6))
plt.show()
Some distinguishable patterns appear when we plot the data. The time-
series has seasonality pattern, such as sales are always low at the
beginning of the year and high at the end of the year. There is always an
upward trend within any single year with a couple of low months in the
mid of the year. We can also visualize our data using a method called
time-series decomposition that allows us to decompose our time series
into three distinct components: trend, seasonality, and noise.
The plot above clearly shows that the sales of furniture is unstable,
along with its obvious seasonality.
results = mod.fit()
mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
results.plot_diagnostics(figsize=(16, 8))
plt.show()
NOTE : our model diagnostics suggests that the model residuals
are near normally distributed.
Validating forecasts
pred = results.get_prediction(start=pd.to_datetime('2017-01-
01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead
Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()
plt.show()
plt.legend()
plt.show()