Data Science Tutorial 1686911993
Data Science Tutorial 1686911993
1. ID : ID Number of Customers.
2. Warehouse block : The Company have big Warehouse which is divided in to block such as
A,B,C,D,E.
3. Mode of shipment :The Company Ships the products in multiple way such as Ship, Flight and
Road.
4. Customer care calls : The number of calls made from enquiry for enquiry of the shipment.
5. Customer rating : The company has rated from every customer. 1 is the lowest (Worst), 5 is the
highest (Best).
6. Cost of the product : Cost of the Product in US Dollars.
7. Prior purchases : The Number of Prior Purchase.
8. Product importance : The company has categorized the product in the various parameter such as
low, medium, high.
9. Gender : Male and Female.
10. Discount offered : Discount offered on that specific product.
11. Weight in gms : It is the weight in grams.
12. Reached on time : It is the target variable, where 1 Indicates that the product has NOT reached on
time and 0 indicates it has reached on time.
Data Pre-processing
Import library
In [53]: import numpy as np
import pandas as pd
#Visualization
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('fivethirtyeight') # use style fivethirtyeight
import seaborn as sns
from matplotlib import rcParams
import warnings
warnings.filterwarnings("ignore")
# Scaling
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Selection
from scipy.stats import chi2_contingency
# Evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report
# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
Import file
In [54]: df = pd.read_csv(r"C:\Users\HP\Downloads\Train.csv")
0 1 D Flight 4 2 177
1 2 F Flight 4 5 216
2 3 A Flight 2 2 183
3 4 B Flight 3 3 176
4 5 C Flight 2 2 184
Because of the target's name is too long, so we simplify the name to ease the next step.
(10999, 12)
Out[56]:
In [60]: df.describe()
id 0
Out[64]:
warehouse_block 0
mode_of_shipment 0
customer_care_calls 0
customer_rating 0
cost_of_the_product 0
prior_purchases 0
product_importance 0
gender 0
discount_offered 0
weight_in_gms 0
is_late 0
dtype: int64
Identify outliers
# Define value
nilai_min = df_dt[col].min()
nilai_max = df_dt[col].max()
lower_lim = Q1 - (1.5*IQR)
upper_lim = Q3 + (1.5*IQR)
else:
print('Outlier is not found in column',col,'\n')
In [69]: # We handle outlier with replace the value with upper_bound or lower_bound
for col in ['prior_purchases', 'discount_offered']:
# Initiate Q1
Q1 = df_dt[col].quantile(0.25)
# Initiate Q3
Q3 = df_dt[col].quantile(0.75)
# Initiate IQR
IQR = Q3 - Q1
# Initiate lower_bound & upper_bound
lower_bound = Q1 - (IQR * 1.5)
upper_bound = Q3 + (IQR * 1.5)
We didn't remove the outliers, but replacing with upper bound and lower bound. And we can see in the
visualization above, there is no outliers detected.
In [77]: df_dt.describe()
category = ['warehouse_block','mode_of_shipment','product_importance',
'gender','customer_rating']
chi2_check = []
# Iteration
for col in category:
# If pvalue < 0.05
if chi2_contingency(pd.crosstab(df_dt['is_late'], df_dt[col]))[1] < 0.05 :
chi2_check.append('Reject Null Hypothesis')
# If pvalue > 0.05
else :
chi2_check.append('Fail to Reject Null Hypothesis')
From the result above, product_importance with high category has a correlation with our target.
In [80]: # one hot encoding feature product_importance and keep high category
onehots = pd.get_dummies(df_dt['product_importance'], prefix = 'product_importance')
df_dt = df_dt.join(onehots)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999 entries, 0 to 10998
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_care_calls 10999 non-null float64
1 customer_rating 10999 non-null int64
2 cost_of_the_product 10999 non-null float64
3 prior_purchases 10999 non-null float64
4 discount_offered 10999 non-null float64
5 weight_in_gms 10999 non-null float64
6 is_late 10999 non-null int64
7 product_importance_high 10999 non-null uint8
dtypes: float64(5), int64(2), uint8(1)
memory usage: 612.4 KB
Target Visualization
In [82]: delay = pd.DataFrame(df_eda.groupby(['is_late'])['id'].count()/len(df_eda)).reset_ind
plt.pie(delay['id'],labels=delay['is_late'],autopct='%.2f%%');
Descriptive Statistic
Categorical values
plt.subplot(141);
sns.countplot(df_eda[col], palette = 'colorblind', orient='v');
plt.title('Countplot')
plt.tight_layout();
plt.subplot(143);
df_eda[col].value_counts().plot.pie(autopct='%1.2f%%');
plt.title('Pie chart')
plt.legend()
Summary :
Numeric values
In [85]: df_eda[numeric].describe()
Summary :
Distribution of customer_care_calls, Customer_rating, Cost_of_the_Product, Prior_purchases
look normal, beacuse the mean and the median are close, while discount_offered and
weight_in_grams are indicated skewed.
Correlation Heatmap
In [86]: plt.figure(figsize=(7,6));
sns.heatmap(df_eda.corr(), annot = True, fmt = '.2f', cmap = 'Reds');
1. Target is_late has a moderate positive correlation with discount_offered & weak negative correlation
with weight_in_gms.
2. Feature customer_care_calss has a weak positive correlation with cost_of_the_product and
negative correlation with weight_in_gms.
3. Feature discount_offered has a moderate negative correlation with weight_in_gms.
Categorical - Categorical
Based on Gender
In [87]: i=1
plt.figure(figsize=(15,10))
for col in ['mode_of_shipment', 'warehouse_block', 'product_importance']:
plt.subplot(2,2,i)
sns.countplot(df_eda[col], hue=df_eda['gender'], palette="ch:.25")
i+=1
Summary :
Total parcels of female customers in the warehouse_block are more dominant than male
customers, except in warehouse_block B.
In [88]: i=1
plt.figure(figsize=(15,10))
for col in ['mode_of_shipment', 'warehouse_block']:
plt.subplot(2,2,i)
sns.countplot(df_eda[col], hue=df_eda['product_importance'], palette="ch:.25")
i+=1
Summary :
In [90]: i=1
plt.figure(figsize=(15,10))
for col in ['mode_of_shipment', 'warehouse_block', 'product_importance',
'gender','customer_rating']:
plt.subplot(2,3,i)
sns.countplot(df_eda[col], hue=df_eda['is_late'], palette="ch:.25")
i+=1
plt.legend(['on_time','late']);
Summary :
Default Parameter
# Logistic Regression
lr_train_score, lr_test_score, lr_pr, lrtr_re, lrte_re, lr_f1, lr_auc = fit_evalu
lr, Xtrain, ytrain, Xtest, ytest)
# Decision Tree
dt_train_score, dt_test_score, dt_pr, dttr_re, dtte_re, dt_f1, dt_auc = fit_evalu
dt, Xtrain, ytrain, Xtest, ytest)
# Random Forest
rf_train_score, rf_test_score, rf_pr, rftr_re, rfte_re, rf_f1, rf_auc = fit_evalu
rf, Xtrain, ytrain, Xtest, ytest)
# KNN
knn_train_score, knn_test_score, knn_pr, knntr_re, knnte_re, knn_f1, knn_auc = fi
knn, Xtrain, ytrain, Xtest, ytest)
# SVC
svc_train_score, svc_test_score, svc_pr, svctr_re, svcte_re, svc_f1, svc_auc = fi
svc, Xtrain, ytrain, Xtest, ytest)
return model_comparison
In [95]: model_comparison_default(X,y)
Out[95]: F1
Model Accuracy_Train Accuracy_Test Precision Recall_Train Recall_Test AUC
Score
Logistic
0 0.63 0.63 0.67 0.75 0.75 0.71 0.6
Regression
From the result above, only Logistic Regression and SVC which are neither overfitting nor
underfiting. Logistic Regression has the highest recall. Let's see with tuned parameter.
Hyperparameter
Logistic Regression
# Inisiasi model
logres = LogisticRegression(random_state=42) # Init Logres dengan Gridsearch, cross va
model = RandomizedSearchCV(logres, hyperparameters, cv=5, random_state=42, scoring='
LogisticRegression(C=0.0001, random_state=42)
Out[96]:
Decision Tree
# Initiate hyperparameters
hyperparameters = dict(max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
max_features=max_features,
criterion = criterion,
splitter = splitter)
# Initiate model
dt_tun = DecisionTreeClassifier(random_state=42)
model = RandomizedSearchCV(dt_tun, hyperparameters, cv=10, scoring='recall',random_st
model.fit(Xtrain, ytrain)
y_pred_tun = model.predict(Xtest)
model.best_estimator_
Random Forest
RandomForestClassifier(max_depth=50, random_state=42)
Out[98]:
KNN
#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
#Use RandomizedSearchCV
clf = RandomizedSearchCV(KNN_2, hyperparameters, cv=10, scoring = 'recall')
KNeighborsClassifier(leaf_size=36, n_neighbors=3)
Out[99]:
SVC
#Convert to dictionary
hyperparameters = dict(kernel=kernel, C=C, gamma=gamma)
# Initiate model
svc = SVC(random_state=42)
model = RandomizedSearchCV(svc, hyperparameters, cv=5, random_state=42,
scoring='recall')
Tuned Parameter
# Logistic Regression
lr_train_score, lr_test_score, lr_pr, lrtr_re, lrte_re, lr_f1, lr_auc = fit_evalu
lr_tune, Xtrain, ytrain, Xtest, ytest)
# Decision Tree
dt_train_score, dt_test_score, dt_pr, dttr_re, dtte_re, dt_f1, dt_auc = fit_evalu
dt_tune, Xtrain, ytrain, Xtest, ytest)
# Random Forest
rf_train_score, rf_test_score, rf_pr, rftr_re, rfte_re, rf_f1, rf_auc = fit_evalu
rf_tune, Xtrain, ytrain, Xtest, ytest)
# KNN
knn_train_score, knn_test_score, knn_pr, knntr_re, knnte_re, knn_f1, knn_auc = fi
knn_tune, Xtrain, ytrain, Xtest, ytest)
# SVC
svc_train_score, svc_test_score, svc_pr, svctr_re, svcte_re, svc_f1, svc_auc = fi
svc_tune, Xtrain, ytrain, Xtest, ytest)
return model_comparison
In [104… model_comparison_tuned(X,y)
Out[104]: F1
Model Accuracy_Train Accuracy_Test Precision Recall_Train Recall_Test AUC
Score
Logistic
0 0.59 0.6 0.6 1.0 1.0 0.75 0.5
Regression
Decision Tree algorithm with hyper-parameter tuning has a good balance between its score, also neither
underfitting nor overfitting.
Confusion matrix
In [105… from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
Feature Importance
In [108… feat_importances = pd.Series(dt_tune.feature_importances_, index=X.columns)
ax = feat_importances.nlargest(25).plot(kind='barh', figsize=(10, 8))
ax.invert_yaxis()
plt.xlabel('Score');
plt.ylabel('Feature');
plt.title('Feature Importance Score');
Recommendation for E-Commerce :
The operation team should add more manpower when there is a sale program, especially for the
discount more than 10% and the parcel weight is 1 - 4 Kg.
The parcel should not be centralized in the warehouse block F, so that the handling is not too
crowded which can cause the late shipment.
Adding more features can imporve model performance, such as delivery time estimation, delivery
date, customer address, and courier.
In [ ]:
Name of the Project- Product Shipment Delivered on time or not ?
By Pooja Keer
Batch- PGA19(THANE)
E-COMMERCE INDUSTRY-
" Ecommerce" or "electronic commerce" is the trading of goods and services on the internet.
The e-commerce industry has seen significant growth in recent years, with more and
more people shopping online. As a result, there is demand for efficient and reliable
shipment services to deliver goods to customers in a timely manner.
One of the major challenges in e-commerce shipment is the management of the
delivery process. The shipment process involves multiple stages, from receiving the
order to packaging the goods and finally delivering them to the customer.
Each stage of the process must be carefully coordinated to ensure timely delivery and
minimize the risk of errors or delays.
E-COMMERCE SHIPPING PROCESS
The shipping process involves everything from receiving a
customer order to preparing it for last-mile delivery. The shipping
process can be broken down into three primary stages:
•Order receiving: make sure items are in stock to fulfill the order
•Order processing: verify order data and make sure it’s accurate
(e.g., verifying the shipping address)
•Order fulfillment: a picking list is generated and items are picked,
packed, and prepared to be shipped
HOW IT AFFECTS INDUSTRY?
◼ Customer dissatisfaction: Customers expect their orders to be delivered on time. If products are
not delivered on time, customers may become dissatisfied with the service and the e-commerce
platform. This can lead to negative reviews and decreased customer loyalty.
◼ Loss of sales: Delayed product delivery can lead to canceled orders, which can result in a loss of
sales for the e-commerce platform. Customers may also choose to shop with competitors who are
better at delivering products on time.
◼ Increased shipping costs: If products are not delivered on time, the e-commerce platform may
need to pay for expedited shipping or other additional costs to meet the delivery deadline. This can
increase the shipping costs, which can negatively impact the company's bottom line.
◼ Reputation damage: The timely delivery of products is critical to maintaining the e-commerce
platform's reputation. If products are not delivered on time, it can damage the company's
reputation and lead to decreased trust from customers.
WHAT COULD BE BENEFITS OF DELIVERING PRODUCT ON TIME ?
1. Data Preprocessing
2. Data Visualization
3. Model Fitting
4. Prediction
5. Hyperparameter Tuning
1.Data Pre-processing-
III)Product_importance has 3 unique values and mostly priority of products are low.
● After doing EDA of numeric ,Distribution of customer_care_calls, Customer_rating, Cost_of_the_Product,
Prior_purchases is normal, because the mean and the median are close, while discount_offered and
weight_in_grams are indicated skewed.
● The Correlation Heatmap is obtained for studying the correlation with target variable as follows-
Based on the Correlation heatmap above :
1. Target is_late has a moderate positive correlation with discount_offered & weak negative correlation with weight_in_gms.
2. Feature customer_care_calss has a weak positive correlation with cost_of_the_product and negative correlation
withweight_in_gms.
3. Feature discount_offered has a moderate negative correlation with weight_in_gms.
–Decision Tree algorithm with hyper-parameter tuning has a good balance between its score, also
neither underfitting nor overfitting.
–Confusion Matrix after parameter tuning-
7.Feature Importance-
● The operation team should add more manpower when there is a sale program, especially for
the discount more than 10% and the parcel weight is 1 - 4 Kg.
● The parcel should not be centralized in the warehouse block F, so that the handling is not too
crowded which can cause the late shipment.
● Adding more features can imporve model performance, such as delivery time estimation,
CONCLUSION
◼ The timely delivery of products can also have a significant financial impact on the e-commerce
business. If products are delivered on time, it can lead to increased sales, repeat business, and revenue
growth. However, if products are consistently delivered late, it can lead to canceled orders, lost sales,
and increased shipping costs, which can negatively impact the company's bottom line.
◼ Late delivery of products can have legal consequences, particularly if there are contractual obligations
regarding delivery times. Failure to meet these obligations can result in breach of contract lawsuits,
financial penalties, and damage to the business reputation.
◼ Timely delivery of products is critical to maintaining a positive business reputation. If products are
consistently delivered on time, it can lead to a positive brand reputation and increased customer
trust. However, if products are frequently delivered late, it can damage the business reputation and
lead to decreased customer trust.
FURTHER ENHANCING MODEL
◼ The model Applied for the business problem were machine learning based. By maintaining accuracy ,further
we can enhance by several deep learning models that could be useful for predicting delivery of a product
on time.
◼ Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Long Short-Term Memory
(LSTM) networks, Autoencoders.
◼ Time-series models are useful for analyzing data that changes over time, such as the frequency and timing
of deliveries.You could use time-series models to predict the delivery time based on historical data,
identifying patterns and trends that can help you make accurate predictions.
◼ RapidMiner can be useful for predicting the delivery of products on time. RapidMiner is a data science
platform that offers a variety of tools and techniques for data preprocessing, modeling, and evaluation. By
leveraging RapidMiner's predictive modeling capabilities, you can build and train a model that can predict
the likelihood of delivery delays based on various factors such as shipping method, location, product type,
and weather conditions.
THANK YOU