Assignment - Jupyter Notebook
Assignment - Jupyter Notebook
Assignment
You have been provided with a dataset containing information about customers of an e-commerce company. The task is
to build a binary classification model using logistic regression to predict whether a customer will make a purchase or not
based on their demographic and browsing behavior data. The dataset consists of the following features:
email
address
avatar
time on app
time on website
length of membership
yearly amount spent
The target variable is: Purchase (binary: 1 if the customer made a purchase over $450, 0 otherwise)
Instructions: Load the dataset and perform any necessary data preprocessing steps. Split the data into training and
testing sets (e.g., 80% training, 20% testing). Train a logistic regression model using the training data. Evaluate the
model's performance on the testing data using appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score).
Provide a brief summary of the model's performance and any insights you gather from the results. Note: You can use any
programming language or machine learning libraries of your choice. The aim of this problem is to assess your ability to
quickly understand the problem, preprocess the data, build a logistic regression model, evaluate its performance, and
derive meaningful insights from the results within a limited timeframe.
In [340]:
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 from sklearn.preprocessing import StandardScaler
5 from imblearn.over_sampling import RandomOverSampler
6
7 %matplotlib inline
In [341]:
1 df = pd.read_csv("scpl_folder/data.csv")
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 1/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
In [342]:
1 df.head(10)
Out[342]:
Time Yearly
Time on Length of
\tEmail Address Avatar on Amount C
Website Membership
App Spent
5791 Jessica
acampbell@sanchez-
4 CoveMckinneyborough, Wheat 11.45 37.58 2.59 420.74
velasquez.info
OK 64460-7536
9991 Macdonald
6 [email protected] SquaresVasquezborough, Purple 10.97 36.61 2.87 404.82
WY 73586...
In [343]:
In [344]:
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 2/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
In [345]:
1 df.describe()
Out[345]:
Time on App Time on Website Length of Membership Yearly Amount Spent purchase
In [346]:
1 df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Email 500 non-null object
1 Address 500 non-null object
2 Avatar 500 non-null object
3 Time on App 500 non-null float64
4 Time on Website 500 non-null float64
5 Length of Membership 500 non-null float64
6 Yearly Amount Spent 500 non-null float64
7 Clean_Address_Loc 500 non-null object
8 Clean_Address_County 500 non-null object
9 purchase 500 non-null float64
dtypes: float64(5), object(5)
memory usage: 39.2+ KB
In [347]:
1 df.corr()
Out[347]:
Length of
0.029240 -0.047443 1.000000 0.809184 0.601839
Membership
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 3/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
1 df = df.drop(['Email','Address','Clean_Address_Loc'], axis=1)
2 df
Out[289]:
Yearly
Time on Time on Length of
Avatar Amount Clean_Address_County purchase
App Website Membership
Spent
In [290]:
In [291]:
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 4/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
In [292]:
1 df = vect(df,'Avatar')
2 df.columns =['Time on App', 'Time on Website','Length of Membership', 'Yearly Amount Spent','Cl
3 df = vect(df,'Clean_Address_County')
4 df.columns =['Time on App', 'Time on Website','Length of Membership', 'Yearly Amount Spent','pu
5 df.head()
0 1 2 3 4 5 6 7
0 10.16 37.76 4.78 521.24 AR 1.0 0 0
1 13.46 37.24 2.94 503.98 WY 1.0 0 0
2 12.01 36.53 4.71 576.48 IA 1.0 0 0
3 10.1 38.04 4.24 418.6 MO 0.0 0 0
4 11.45 37.58 2.59 420.74 OK 0.0 0 0
.. ... ... ... ... .. ... .. ..
495 12.94 36.73 4.56 544.41 UT 1.0 0 0
496 11.83 36.84 3.61 502.09 MI 1.0 0 0
497 11.68 38.72 3.59 463.59 MT 1.0 0 0
498 12.75 36.71 3.28 548.28 Bo 1.0 0 0
499 12.13 38.19 4.02 597.74 SC 1.0 0 0
Out[292]:
In [294]:
1 train,test = np.split(df.sample(frac=1),[int(0.8*len(df))])
In [295]:
1 train.shape, test.shape
Out[295]:
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 5/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
In [296]:
1 train = pd.DataFrame(train)
2 train.columns =df.columns
3 print(train.head())
4
5 test = pd.DataFrame(test)
6 test.columns =df.columns
7 print(test.head())
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 6/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
In [297]:
1 def column_to_move(df):
2 column_to_move = df.pop("purchase")
3 df.insert(8, "purchase", column_to_move)
4 return df
5
6 train = column_to_move(train)
7 test = column_to_move(test)
8
9 train,test
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 7/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
Out[297]:
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 8/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
In [305]:
In [306]:
In [332]:
C:\Users\creat\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:458: Co
nvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://scikit-lear
n.org/stable/modules/preprocessing.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (h
ttps://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
n_iter_i = _check_optimize_result(
Out[332]:
▾ LogisticRegression
LogisticRegression()
In [333]:
1 y_pred = lg_model.predict(X_test)
In [334]:
In [335]:
1 print(classification_report(y_test, y_pred))
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 9/10
7/28/23, 1:10 PM Assignment - Jupyter Notebook
Summary
1. Created and cleaned address to get relaitable data point for the users to see similarity in behaviour-- built 2 more
colunms -- cleaned_address_loc and cleaned_address_county
2. Load the data and cleaned the columns name
3. Built the purchase colunms based on the contition mentioned
4. Looked on the basic stats before cleaning and resampling the data (describe and info)
5. Vectorize clean_address_county and avatar to use them in the regression
6. Split into train and test as mentioned (80:20)
7. Resample -- oversample to have a decent data to make generic model
8. build logistic regression model and predict using the same on the test data
9. Showcase the stats [ F1 : 98% accuracy ]
Insights
1. Higher the time spend on App and lenght of membership -- higher the probaboility they will make a purchase
2. App is more efective then web for purchase conversion
3. Lenght of membership has highest impact on the purchase
localhost:8888/notebooks/Desktop/Personal/code/Practice_ML/Assignment.ipynb# 10/10