0% found this document useful (0 votes)

15 views4 pages

Phase 2

Phase 2 of the project focuses on data preprocessing for deep learning, including cleaning the dataset by handling missing values, outliers, and inconsistencies. Feature scaling, encoding, and dimensionality reduction techniques like PCA were applied to prepare the data for the autoencoder model. The model was designed and trained to extract latent features, setting the stage for the upcoming clustering phase.

Uploaded by

syedmaheboobmeersab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views4 pages

Phase 2

Uploaded by

syedmaheboobmeersab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Advanced Market Segmentation Using Deep Clustering

Phase 2: Data Preprocessing and Model Design

2.1 Overview of Data Preprocessing

After completing the initial data exploration in Phase 1, Phase 2 focuses on preparing the dataset
for deep learning. This involves cleaning, transforming, and scaling the data to ensure it is
suitable for training the deep autoencoder model. The primary goal of this phase is to handle
missing values, outliers, and data inconsistencies, and to apply appropriate transformations such
as feature scaling, encoding, and dimensionality reduction.

2.2 Data Cleaning: Handling Missing Values, Outliers, and Inconsistencies

Cleaning the dataset is a critical step to ensure that the input data is accurate and ready for
modeling. In this phase, we address the following issues:

 Missing Values: Missing values can cause models to fail or generate biased results. For
this project, missing data were identified using statistical methods such as descriptive
statistics (mean, median) and visualization techniques like heatmaps. The following
strategies were employed to handle missing values:
o Numerical Features: If a numerical feature had missing values, they were
imputed using the mean (if the data was approximately normal) or median (if the
data was skewed) to avoid distortion from extreme values.
o Categorical Features: Missing categorical values were imputed using the mode
(the most frequent value) to ensure consistency in categorical distributions.
 Outliers: Outliers can significantly skew model results, especially for algorithms that are
sensitive to extreme values. For this project, outliers were detected using visualization
methods such as boxplots and statistical methods like the Z-score. Once identified,
extreme values were either:
o Capped: Limiting outliers to a maximum or minimum threshold (winsorization).
o Removed: For features with extreme outliers that significantly deviated from the
overall distribution, those records were removed from the dataset to avoid model
bias.
 Inconsistencies: Inconsistencies within the data, such as duplicate entries or
contradictory information (e.g., age and income), were also cleaned. Duplicate rows were
identified and removed, and any contradictory entries were flagged for further review or
corrections.

Methods you used to pre process the data

Coding of your project for preprocessing with Screen Shots

2.3 Feature Scaling and Normalization

Scaling the features ensures that they are comparable in magnitude, which is particularly
important for deep learning models like autoencoders.

Steps for normalization an scaling with coding and its screen shots

2.4 Feature Transformation and Dimensionality Reduction

Transforming features helps improve the performance of the deep learning model by reducing
noise or irrelevant information and highlighting important patterns. This phase also includes
applying dimensionality reduction techniques to handle high-dimensional data.

 Encoding Categorical Variables: Categorical variables, such as gender, product

categories, or region, were converted into numerical representations using One-Hot
Encoding. This method avoids introducing unintended ordinal relationships, as it
represents each category with a binary vector.

For example, a feature "Region" with three categories (North, South, East) would be
encoded as three binary columns: "Region_North", "Region_South", and "Region_East."
The values for each column would be 1 or 0, depending on which region the customer
belongs to.

 Dimensionality Reduction: Given the high number of features in the dataset, it was
important to reduce the dimensionality to speed up the training process and prevent
overfitting. Several dimensionality reduction techniques were considered:
o Principal Component Analysis (PCA): PCA was applied to reduce the
dimensionality of the dataset while retaining the maximum amount of variance.
This helped remove multicollinearity between highly correlated features and
provided a more compact representation of the data.

After PCA, the dataset's dimensionality was reduced to a smaller set of principal
components, which still captured most of the underlying patterns but with fewer
features. For instance, instead of using 30 original features, we reduced them to
10 principal components.

o Feature Selection: Based on the initial analysis, redundant features that provided
minimal value to the clustering process were removed. This was done by
analyzing feature importance or using techniques like Variance Thresholding,
which eliminates features with low variance across samples.

Importance of dimensionality reduction in your project

Source code of DR with screenshots

2.5 Autoencoder Model Design

With the data cleaned and transformed, we now turn to the model design. The focus in this
project is on using an autoencoder for deep clustering. Autoencoders are unsupervised neural
networks that learn to represent input data in a compressed latent space. The architecture of the
autoencoder was designed as follows:

 Encoder Architecture: The encoder takes the preprocessed and transformed data as
input and compresses it into a latent feature space. The encoder has the following layers:
o An input layer that takes the feature vector.
o Several dense layers with progressively decreasing units, such as 64 neurons in
the first hidden layer, followed by 32 neurons, and the final latent layer with 8
neurons.
 Decoder Architecture: The decoder mirrors the encoder, expanding the latent features
back into the original feature space:
o Dense layers with progressively increasing units to reconstruct the original input.
o The output layer uses the sigmoid activation function to ensure the reconstructed
values are between 0 and 1, which matches the scaled features.
 Loss Function and Optimizer: The model uses Mean Squared Error (MSE) as the
loss function since the goal is to minimize the reconstruction error between the input data
and the model's reconstruction. The Adam optimizer was chosen for efficient gradient-
based optimization.

2.6 Model Training and Validation

After designing the autoencoder architecture, the next step was training the model. The data was
split into training and validation sets to evaluate the model's performance. The model was
trained for 50 epochs with a batch size of 32.

During training, the model’s reconstruction error was monitored on both the training and
validation datasets to ensure that the model was generalizing well and not overfitting.
Hyperparameters, such as the learning rate and batch size, were fine-tuned to achieve optimal
performance.

Once trained, the encoder was used to extract the latent features, which represent the
compressed version of the original data. These features will be used for the clustering phase in
the next step.

Code for the preprocessing and its screenshots

2.7 Conclusion of Phase 2

Phase 2 has focused on preparing the dataset for deep learning by cleaning the data, handling
missing values and outliers, and transforming the features. Scaling and encoding were applied to
ensure that the data was ready for the auto encoder model. Dimensionality reduction techniques
like PCA helped improve model efficiency by reducing noise and redundancy. With the model
designed and trained, the latent features are now ready for clustering, which will be the next step
in the segmentation process. This phase has established a strong foundation for the clustering and
segmentation in subsequent phases of the project.

Sugar Plant Specifications 5000 TCD-7500 TCD
80% (5)
Sugar Plant Specifications 5000 TCD-7500 TCD
104 pages
Adm2 FR Operating Manual 15.07.02
71% (7)
Adm2 FR Operating Manual 15.07.02
160 pages
Kingdom of Saudi: Jubail Industrial City Project
No ratings yet
Kingdom of Saudi: Jubail Industrial City Project
45 pages
ReactJS PDF
No ratings yet
ReactJS PDF
403 pages
Hoeganaes Corporation
No ratings yet
Hoeganaes Corporation
11 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Jasper Busschers - Thesis Final
No ratings yet
Jasper Busschers - Thesis Final
39 pages
2001 Chevy S10 T10 Blazer Distributor Replacement REMOVAL PROCEDURE
50% (2)
2001 Chevy S10 T10 Blazer Distributor Replacement REMOVAL PROCEDURE
7 pages
Project Work On ANN
No ratings yet
Project Work On ANN
14 pages
Cummins DGFB Spec Sheet
No ratings yet
Cummins DGFB Spec Sheet
4 pages
Bentinho Massaro - 3 Main Teachings PDF
No ratings yet
Bentinho Massaro - 3 Main Teachings PDF
1 page
Chapter 8-Vector Control of Induction Motors PDF
No ratings yet
Chapter 8-Vector Control of Induction Motors PDF
18 pages
Appendix D: Introduction To Flowcharting
No ratings yet
Appendix D: Introduction To Flowcharting
10 pages
8b 10b Encode Decode
No ratings yet
8b 10b Encode Decode
5 pages
Image Classification
No ratings yet
Image Classification
18 pages
Manual - Pdms Support Design
No ratings yet
Manual - Pdms Support Design
84 pages
CEM How To - Final
No ratings yet
CEM How To - Final
84 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
Object Detection and Recognition: Final Project Title
No ratings yet
Object Detection and Recognition: Final Project Title
6 pages
Feature Labs - ML 2.0
No ratings yet
Feature Labs - ML 2.0
13 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
TP3 Mi204 Santos Scardellato
No ratings yet
TP3 Mi204 Santos Scardellato
20 pages
Machine Learning Model Workflow
No ratings yet
Machine Learning Model Workflow
3 pages
Week 6 Unsupervised Learning
No ratings yet
Week 6 Unsupervised Learning
60 pages
A Data-Centric Approach For Training Deep Neural
No ratings yet
A Data-Centric Approach For Training Deep Neural
5 pages
SkillSet Pro: Advanced Employability Skills Mastery Program
No ratings yet
SkillSet Pro: Advanced Employability Skills Mastery Program
11 pages
Question 2.21: What Are The Reasons of Using Load Equalisation in The Electric Drive? Answer
No ratings yet
Question 2.21: What Are The Reasons of Using Load Equalisation in The Electric Drive? Answer
1 page
B11 - B12 - B13 - 0141 - MAT2002 - 100318 - Dr. Sheerin Kayenat - Fall 22-23 - TEE
No ratings yet
B11 - B12 - B13 - 0141 - MAT2002 - 100318 - Dr. Sheerin Kayenat - Fall 22-23 - TEE
2 pages
Rotella DD: Two-Stroke Diesel Engine Oil
No ratings yet
Rotella DD: Two-Stroke Diesel Engine Oil
1 page
Installer Uninstaller Readme
No ratings yet
Installer Uninstaller Readme
2 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
Unit 8
No ratings yet
Unit 8
4 pages
Game Designer Resume
100% (1)
Game Designer Resume
6 pages
Interim Report Capstone
No ratings yet
Interim Report Capstone
61 pages
An 120
No ratings yet
An 120
6 pages
Skoda Enyaq Brochure April 2024
No ratings yet
Skoda Enyaq Brochure April 2024
43 pages
Lesson 2
No ratings yet
Lesson 2
9 pages
Unit 1
No ratings yet
Unit 1
23 pages
NN 7
No ratings yet
NN 7
26 pages
1 s2.0 S0925231221009486 Main
No ratings yet
1 s2.0 S0925231221009486 Main
7 pages
Project Report
No ratings yet
Project Report
9 pages
Phase 2
No ratings yet
Phase 2
4 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
155 pages
Control System Term Paper
No ratings yet
Control System Term Paper
12 pages
Towards FL
No ratings yet
Towards FL
26 pages
Phase 1
No ratings yet
Phase 1
4 pages
Phase 2 Aiml
No ratings yet
Phase 2 Aiml
7 pages
Keras Slim Residual Neural Network Classifier Using Training Schedule (1 Layer 64 Units) Archive Zip - 2024-10-16 08 50 01 Documentation
No ratings yet
Keras Slim Residual Neural Network Classifier Using Training Schedule (1 Layer 64 Units) Archive Zip - 2024-10-16 08 50 01 Documentation
35 pages
DAC: Deep Autoencoder-Based Clustering, A General Deep Learning Framework of Representation Learning
No ratings yet
DAC: Deep Autoencoder-Based Clustering, A General Deep Learning Framework of Representation Learning
12 pages
Water Quality Forecasting
No ratings yet
Water Quality Forecasting
3 pages
Assignment 8
No ratings yet
Assignment 8
2 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
6 pages
Customer Segmentation 2
No ratings yet
Customer Segmentation 2
19 pages
Database Administration Level IV Theory Exam 6
No ratings yet
Database Administration Level IV Theory Exam 6
5 pages
Pseudocode Cheat Sheet 1
No ratings yet
Pseudocode Cheat Sheet 1
4 pages
Bug Bounty Training Program
No ratings yet
Bug Bounty Training Program
28 pages
Deep Learning Project Plan, Architecture, and Design Document
No ratings yet
Deep Learning Project Plan, Architecture, and Design Document
2 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
A Mini Project Report On Autoencoders
No ratings yet
A Mini Project Report On Autoencoders
39 pages
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
No ratings yet
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
21 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Report 2
No ratings yet
Report 2
17 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
ML Mdu 2024 10939237
No ratings yet
ML Mdu 2024 10939237
20 pages
COE101 - Project Guidelines (Spring 24-25)
No ratings yet
COE101 - Project Guidelines (Spring 24-25)
19 pages
Bus Times
No ratings yet
Bus Times
2 pages
Project On Alzheimer Explaination
No ratings yet
Project On Alzheimer Explaination
13 pages
Technical Questions and Answers
No ratings yet
Technical Questions and Answers
12 pages
Deep Learning Models (Basic)
No ratings yet
Deep Learning Models (Basic)
35 pages
FIRE FIGHTING TANK - MEP-Model
No ratings yet
FIRE FIGHTING TANK - MEP-Model
1 page
Data Management For Production Quality Deep Learn Models
No ratings yet
Data Management For Production Quality Deep Learn Models
9 pages
DeekshikaJadyada21 AP24LDS11
No ratings yet
DeekshikaJadyada21 AP24LDS11
5 pages
Week 6
No ratings yet
Week 6
8 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Lab1 Presentation 202208
No ratings yet
Lab1 Presentation 202208
33 pages
Advanced Market Segmentation Using Deep Clustering Phase 3
No ratings yet
Advanced Market Segmentation Using Deep Clustering Phase 3
4 pages
Course Plan - Linux Lab
No ratings yet
Course Plan - Linux Lab
12 pages
Class 12 - Holiday Homework - 2025-26
No ratings yet
Class 12 - Holiday Homework - 2025-26
2 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
Weekly Activity 6
No ratings yet
Weekly Activity 6
5 pages
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Phase 2

Uploaded by

Phase 2

Uploaded by

Advanced Market Segmentation Using Deep Clustering

Phase 2: Data Preprocessing and Model Design

2.1 Overview of Data Preprocessing

2.2 Data Cleaning: Handling Missing Values, Outliers, and Inconsistencies

Methods you used to pre process the data

Coding of your project for preprocessing with Screen Shots

2.4 Feature Transformation and Dimensionality Reduction

 Encoding Categorical Variables: Categorical variables, such as gender, product

Importance of dimensionality reduction in your project

Source code of DR with screenshots

2.5 Autoencoder Model Design

2.6 Model Training and Validation

Code for the preprocessing and its screenshots

2.7 Conclusion of Phase 2

You might also like