0% found this document useful (0 votes)
8 views4 pages

Tasks For Students

The document outlines a data wrangling and preprocessing scenario for customer segmentation in an e-commerce business. It details the dataset structure, including customer demographics and purchasing behavior, and provides tasks for handling missing values, data transformation, feature engineering, and data visualization. The goal is to identify high-value customers and optimize marketing strategies through various analytical techniques.

Uploaded by

raguammu38
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Tasks For Students

The document outlines a data wrangling and preprocessing scenario for customer segmentation in an e-commerce business. It details the dataset structure, including customer demographics and purchasing behavior, and provides tasks for handling missing values, data transformation, feature engineering, and data visualization. The goal is to identify high-value customers and optimize marketing strategies through various analytical techniques.

Uploaded by

raguammu38
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Wrangling and Preprocessing

Scenario: Customer Segmentation for an E-Commerce Business


Business Challenge:

An e-commerce company wants to segment its customers based on purchasing


behavior, demographics, and engagement. The goal is to identify high-value
customers, understand different buyer personas, and optimize marketing strategies.

Dataset Overview:

The dataset contains the following columns:

●​ Customer_ID – Unique identifier for each customer


●​ Age – Customer's age
●​ Gender – Male/Female/Other
●​ Annual_Income – Yearly income of the customer
●​ Spending_Score – Score based on purchasing behavior (0-100)
●​ Purchase_Frequency – Number of orders placed per month
●​ Last_Transaction_Days – Days since last purchase
●​ Preferred_Category – Customer’s most frequently purchased product
category

Tasks for Students

1️⃣ Handling Missing Values & Duplicates


●​ Identify and count missing values in the dataset.
●​ Remove rows with missing values OR fill missing numerical values using
the mean/median and categorical values with the mode.
●​ Drop duplicate entries if any exist.

2️⃣ Data Transformation: Scaling & Encoding


●​ Normalize Annual_Income and Spending_Score using Min-Max Scaling.
●​ Standardize Purchase_Frequency using Z-score Normalization.
●​ Convert Gender into numerical values using One-Hot Encoding.
●​ Label encode the Preferred_Category column.

3️⃣ Feature Engineering


●​ Create a new feature: Define a Customer Loyalty Score based on
Spending_Score and Purchase_Frequency (e.g., High, Medium, Low).
●​ Binning: Group customers into different income levels (e.g., Low, Medium,
High).
●​ Create an engagement metric: Combine Last_Transaction_Days and
Purchase_Frequency to categorize customers as Active, Dormant, or
Churned.

4️⃣ Data Visualization


1. Univariate Analysis (Distribution of Individual Features)

●​ Age Distribution – Use a histogram or KDE plot to show the distribution


of customer ages.
●​ Annual Income & Spending Score – Use box plots to detect
income/spending score outliers.
●​ Preferred Category Count – Use a bar plot to visualize the frequency of
product categories.

2. Bivariate Analysis (Relationship Between Two Features)

●​ Income vs Spending Score – Use a scatter plot to see spending trends


across income levels.
●​ Gender vs Spending Score – Use a box plot to compare spending habits
across genders.
●​ Correlation Heatmap – Show relationships between numerical features
using a heatmap.

3. Multivariate Analysis

●​ Pair Plot – Use Seaborn’s pairplot to visualize multiple relationships in the


dataset.
4. Customer Segmentation Insights

●​ Loyalty Score Distribution – Use a bar plot to show the count of High,
Medium, and Low loyalty customers.
●​ Engagement Status – Use a pie chart to show the percentage of Active,
Dormant, and Churned customers.

5.Interactive Customer Segmentation Analysis with Plotly

●​ Create an Interactive Scatter Plot of Annual Income vs Spending Score


●​ Color customers based on their Loyalty Score (High, Medium, Low)
●​ Use hover effects to display customer details
●​ Enhance visualization interactivity using Plotly Express

Answers:
# Identify and count missing values
print("Missing Values Count:\n", df.isnull().sum())

# Option 1: Remove rows with missing values


df_dropped = df.dropna()
print("\nData after dropping missing values:\n", df_dropped)

# Option 2: Fill missing values


df_filled = df.copy()
df_filled['A'].fillna(df_filled['A'].mean(), inplace=True) # Fill numerical with
mean
df_filled['B'].fillna(df_filled['B'].mode()[0], inplace=True) # Fill categorical with
mode

print("\nData after filling missing values:\n", df_filled)

# Drop duplicate entries


df_no_duplicates = df_filled.drop_duplicates()
print("\nData after dropping duplicates:\n", df_no_duplicates)

# Normalize Annual_Income & Spending_Score using Min-Max Scaling


scaler = MinMaxScaler()
df[['Annual_Income', 'Spending_Score']] =
scaler.fit_transform(df[['Annual_Income', 'Spending_Score']])

# Standardize Purchase_Frequency using Z-score Normalization


scaler = StandardScaler()
df[['Purchase_Frequency']] = scaler.fit_transform(df[['Purchase_Frequency']])

# Convert Gender into numerical values using One-Hot Encoding


df = pd.get_dummies(df, columns=['Gender'], drop_first=True) # 'drop_first=True'
avoids dummy variable trap

# Label encode the Preferred_Category column


label_encoder = LabelEncoder()
df['Preferred_Category'] = label_encoder.fit_transform(df['Preferred_Category'])

# Display the processed DataFrame


print(df)

You might also like