0% found this document useful (0 votes)

8 views7 pages

Phase 2 Aiml

The document outlines the details of an AIML internship project focused on building a real-time social media analytics pipeline, specifically detailing Phase 2 which involves data preprocessing and model design. Key tasks include data cleaning, feature scaling, dimensionality reduction using PCA, and the design of an autoencoder model for deep clustering. The phase concludes with the successful extraction of latent features, setting the foundation for future analysis and insights.

Uploaded by

bhagyaspatil08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views7 pages

Phase 2 Aiml

Uploaded by

bhagyaspatil08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

IBM AIML INTERNSHIP

VTU - Rooman Internship 2024-25

Project Team Details

College Name: B. L. D. E. A's V. P. Dr. P. G. Halakatti College of Engineering and

Technology

Batch ID: 2753892(AI-Machine Learning)

Project Title: Real-Time Social Media Analytics Pipeline: Building a Robust Data
Processing Framework

Group Name: Group A22

Group Members:

Name: Bhagyashree S Patil

 CAN ID Number: CAN_33806693

Name: Bhagyashree S Bhairagond

 CAN ID Number: CAN_33846014

Name: Deepa B Patil

 CAN ID Number: CAN_ 33834957

Name: Gourishankar Mudhol

• CAN ID Number: CAN_33844810

1|Page
IBM AIML INTERNSHIP

Real-Time Social Media Analytics Pipeline: Building a

Robust Data Processing Framework

Phase 2: Data Preprocessing and Model Design

2.1 Overview of Data Preprocessing

The primary objective of this project's second phase is to complete the essential data
preprocessing tasks required to prepare the real-time social media analytics dataset for deep
learning. The first step in this process is data cleaning, which involves identifying and
resolving common problems such as missing values, outliers, and inconsistencies. Statistical
techniques and visualizations are employed to identify missing values, with appropriate
imputation or removal strategies used based on the nature of the data. Using boxplots and
statistical methods like the Z-score, outliers that could distort model outcomes are detected
and either capped or removed. Additionally, any discrepancies in the dataset, such as
duplicate entries or contradictory information, are eliminated to ensure data integrity.

After data cleaning, feature scaling and normalization are applied. These steps ensure that all
features are on a similar scale, which is crucial for deep learning models, as they are sensitive
to the scale of input features. Methods like min-max scaling and Z-score normalization are
used to standardize features and prevent those with larger magnitudes from dominating the
learning process. This ensures that the data is in the right format for efficient processing by
deep learning models, such as autoencoders, in the context of real-time social media
analytics.

To further enhance model efficiency and avoid overfitting, dimensionality reduction

techniques are employed. Methods like Principal Component Analysis (PCA) are used to
reduce the number of features while maintaining as much of the variance in the data as
possible. This helps prevent high-dimensional data from leading to overfitting and excessive
training times. Additionally, feature selection techniques like variance thresholding are
applied to remove repetitive or unnecessary features. By focusing on the most significant
patterns in the data, these preprocessing steps help to expedite the training process and
improve the model's ability to generalize, thus enhancing the overall efficiency of the real-
time social media analytics pipeline.

2|Page
IBM AIML INTERNSHIP

2.2 Data Cleaning: Handling Missing Values, Outliers, and Inconsistencies

To ensure that the data is reliable and ready for modeling, cleaning was a crucial step in this
phase. The following techniques were employed:

Missing Values:

Missing values can lead to biased results or model failures, impairing the model’s overall
performance. To handle missing values, the following methods were applied:

 Descriptive Statistics: We analyzed the data using measures like the mean and
median to identify trends and patterns in missing data.
 Heatmaps: Visualizing missing data through heatmaps allowed us to pinpoint areas
where imputation was required.

Strategies to Handle Missing Values:

 Imputation: For numerical features, we used the mean or median for imputation,
and for categorical features, we used the mode to fill in missing values.
 Removal: To maintain dataset integrity, we removed rows that had an excessive
number of missing values, ensuring that the imputation process didn’t distort the
analysis.

Outliers:

Outliers can significantly impact models, especially distance-based algorithms like K-means
or autoencoders. The following methods were used to identify and manage outliers:

 Boxplots: Boxplots were used to visually identify outliers in the dataset for each
feature.
 Z-Score Method: This statistical technique was employed to detect data points that
deviate significantly from the mean.

Strategies to Handle Outliers:

 Capping/Clipping: For extreme outliers, the Z-score threshold was applied to limit
data within a specified range, ensuring that these values do not excessively influence
model training.
 Removal: Outliers that were deemed errors or abnormal data points were completely
removed from the dataset to ensure data consistency and model accuracy.

Inconsistencies:

Inconsistent data, such as duplicate entries or contradictory information, was handled by

the following techniques:

 Duplicate Entries: We used pandas' duplicated() function to identify and remove

any duplicate rows, ensuring that data was unique and not redundant.

3|Page
IBM AIML INTERNSHIP

 Contradictory Information: For example, if there were conflicting values such as an

age that didn’t match the income level, those entries were flagged for further
investigation and corrected or removed from the dataset.

2.3 Feature Scaling and Normalization

Feature scaling ensures that all features are comparable in magnitude, which is essential for
deep learning models like autoencoders. Without scaling, features with larger magnitudes
could dominate the learning process, leading to inaccurate model performance. The two main
techniques used for scaling and normalization in this project are Standardization (Z-score
normalization) and Min-Max Scaling.

Steps for Scaling and Normalization:

 Standardization (Z-score normalization):

o This method scales the data such that the mean of the features is 0, and the
standard deviation is 1. It ensures that all features are centered around 0 with a
uniform variance, making them comparable.
 Formula:

Where X minx and Xmax are the minimum and maximum values of the feature, respectively.

 Min-Max Scaling:
o This method rescales the data to a fixed range, usually between [0, 1],
ensuring that all feature values lie within this range.
o Formula:

X−max⁡(X)
×′ = min(x)−max⁡(x)

Code Example for Scaling:

python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization

4|Page
IBM AIML INTERNSHIP

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_minmax_scaled = min_max_scaler.fit_transform(data)

Screenshots of Scaling Process:

2.4 Feature Transformation and Dimensionality Reduction

To prepare the dataset for deep learning, feature transformation and dimensionality reduction
were critical steps to optimize performance and reduce complexity.

1. Dimensionality Reduction:

Given the high number of features in the dataset, dimensionality reduction techniques were
applied to:

 Minimize computational cost.

 Avoid overfitting by reducing redundancy and noise.
 Highlight important patterns in the data.

Techniques Used:

 Principal Component Analysis (PCA):

o PCA was implemented to reduce dimensionality while retaining maximum
variance. This technique helped:
 Remove multicollinearity among highly correlated features.
 Provide a compact and meaningful representation of the dataset.
 Feature Selection:

5|Page
IBM AIML INTERNSHIP

o Features that contributed minimal value to the clustering process were

identified and removed. Methods applied included:
 Analyzing feature importance scores.
 Variance Thresholding to exclude low-variance features across
samples.

2. Importance of Dimensionality Reduction in the Project:

 Enhanced model training speed and efficiency.

 Improved the generalization ability of the model by reducing overfitting risks.
 Provided a cleaner and more interpretable dataset for subsequent analysis.

2.5 Autoencoder Model Design

The focus was on designing and implementing an autoencoder model to perform deep
clustering by compressing data into a latent feature space.

1. Model Architecture:

 Encoder Architecture:
o Input Layer: Accepts the transformed feature vector.
o Dense Layers: Progressively reduces the feature space size (e.g., 64 → 32 →
8 neurons).
o Final Latent Layer: Compresses the data into an 8-dimensional latent space.
 Decoder Architecture:
o Dense Layers: Mirrors the encoder by expanding the latent representation
back to the original feature space.
o Output Layer: Utilizes a sigmoid activation function to normalize
reconstructed values between 0 and 1.

2. Loss Function and Optimizer:

 Loss Function: Mean Squared Error (MSE) to minimize reconstruction errors.

 Optimizer: Adam optimizer for efficient gradient-based optimization.

2.6 Model Training and Validation

With the autoencoder designed, the training process involved the following steps:

1. Data Split:

 The dataset was split into training and validation sets to evaluate model performance
and prevent overfitting.

2. Training Configuration:

 Epochs: 50 epochs.
 Batch Size: 32.

3. Monitoring Performance:

6|Page
IBM AIML INTERNSHIP

 Reconstruction error was continuously monitored on both training and validation

datasets.
 Hyperparameters like learning rate and batch size were fine-tuned to optimize results.

4. Outcome:

 The encoder successfully extracted an 8-dimensional latent feature representation.

 These latent features encapsulate the essential patterns and structures in the original
data, ready for clustering in the next phase.

2.7 Conclusion of Phase 2

Phase 2 focused on transforming and preparing the dataset for deep learning. Key
achievements included:

 Application of PCA and feature selection to reduce dimensionality, thereby

optimizing efficiency and preventing overfitting.
 Development and training of an autoencoder model to extract compressed latent
features for clustering.

The groundwork established in this phase ensures a robust platform for segmentation and
clustering, which will drive actionable insights in subsequent phases of the project.

7|Page

LIQUIDITY POSITION OF GLOBAL IME BANKb
80% (10)
LIQUIDITY POSITION OF GLOBAL IME BANKb
32 pages
UiPath Certified Professional - Specialized AI Pro Exam Description
No ratings yet
UiPath Certified Professional - Specialized AI Pro Exam Description
15 pages
K Nearest Neighbor - Step by Step Tutorial
No ratings yet
K Nearest Neighbor - Step by Step Tutorial
16 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Unit I
No ratings yet
Unit I
41 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Module 1
No ratings yet
Module 1
25 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Phase 2
No ratings yet
Phase 2
4 pages
Phase 2
No ratings yet
Phase 2
4 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Chương
No ratings yet
Chương
12 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
Data Minig Lab Manual
No ratings yet
Data Minig Lab Manual
58 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Machine Learning Model Workflow
No ratings yet
Machine Learning Model Workflow
3 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
ML Lab 3
No ratings yet
ML Lab 3
8 pages
Module 2
No ratings yet
Module 2
8 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
ML Da
No ratings yet
ML Da
55 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
Week 2
No ratings yet
Week 2
3 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Exp 2
No ratings yet
Exp 2
6 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
6 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Unit 2
No ratings yet
Unit 2
11 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Unit 2 ML
No ratings yet
Unit 2 ML
14 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
ML Notes All
No ratings yet
ML Notes All
32 pages
Daa 2425
No ratings yet
Daa 2425
28 pages
Data Cleaning Approaches in Machine Learning Algorithms
No ratings yet
Data Cleaning Approaches in Machine Learning Algorithms
8 pages
Sent-Machine Learning For Data Science
100% (1)
Sent-Machine Learning For Data Science
463 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Supervised Learning Research Paper Final With Images
No ratings yet
Supervised Learning Research Paper Final With Images
11 pages
Course 4
No ratings yet
Course 4
29 pages
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
No ratings yet
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
8 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Presentation 1 -2
No ratings yet
Presentation 1 -2
26 pages
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
100% (1)
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
8 pages
Scatter Diagrams and Karl Pearson Correlation Table by Arun
No ratings yet
Scatter Diagrams and Karl Pearson Correlation Table by Arun
3 pages
Business Intelligence For Human Capital Management - 2020 - IGI Global Custigiglobalcom
No ratings yet
Business Intelligence For Human Capital Management - 2020 - IGI Global Custigiglobalcom
12 pages
Reliance
33% (3)
Reliance
23 pages
Big Data Management and Analytics Brij B Gupta & Mamta pdf download
No ratings yet
Big Data Management and Analytics Brij B Gupta & Mamta pdf download
134 pages
Socioeconomic and Environmental Impacts of Urban Mining in Smokey Mountain, Tondo, Manila
No ratings yet
Socioeconomic and Environmental Impacts of Urban Mining in Smokey Mountain, Tondo, Manila
83 pages
Dataanalysisusing NVivo Thesis Chapter
No ratings yet
Dataanalysisusing NVivo Thesis Chapter
51 pages
"Comparative Financial Analysis": A Research Project Report On
100% (1)
"Comparative Financial Analysis": A Research Project Report On
8 pages
A Comparative Analysis of Computerized Accounting System and Manual Accounting System
83% (6)
A Comparative Analysis of Computerized Accounting System and Manual Accounting System
93 pages
DWM Assignment 2 (24-25 Odd)
No ratings yet
DWM Assignment 2 (24-25 Odd)
2 pages
Pom 6
No ratings yet
Pom 6
36 pages
Problem Set 3-FIEM
No ratings yet
Problem Set 3-FIEM
10 pages
Paper 16635
No ratings yet
Paper 16635
6 pages
Time Series hw5
100% (2)
Time Series hw5
4 pages
Data Analytics For Fraud Detection Iv Year Unit 3,4,5
100% (1)
Data Analytics For Fraud Detection Iv Year Unit 3,4,5
72 pages
Boston Housing Kaggle Challenge With Linear Regression
No ratings yet
Boston Housing Kaggle Challenge With Linear Regression
3 pages
Pembagian Halaman Buku Probstat
No ratings yet
Pembagian Halaman Buku Probstat
3 pages
Melek Emotional Literacy
No ratings yet
Melek Emotional Literacy
19 pages
Business Analyst Scenario Based Interview Q&A
No ratings yet
Business Analyst Scenario Based Interview Q&A
15 pages
Chapter 4: Seasonal Series: Forecasting and Decomposition
No ratings yet
Chapter 4: Seasonal Series: Forecasting and Decomposition
29 pages
PDF
0% (1)
PDF
5 pages
Privacy-Preserving Machine Learning Techniques For Data in Multi Cloud Environments
No ratings yet
Privacy-Preserving Machine Learning Techniques For Data in Multi Cloud Environments
18 pages
Ijlel Template
No ratings yet
Ijlel Template
3 pages
Lecture #2: Prediction, K-Nearest Neighbors: CS 109A, STAT 121A, AC 209A: Data Science
No ratings yet
Lecture #2: Prediction, K-Nearest Neighbors: CS 109A, STAT 121A, AC 209A: Data Science
28 pages
Lllllllolll
No ratings yet
Lllllllolll
24 pages
Ablen - Glyn Dale - Quarter2 - Module1-Lesson2-6
No ratings yet
Ablen - Glyn Dale - Quarter2 - Module1-Lesson2-6
5 pages
Kesalahan Artikulasi Phonemes Bahasa Inggris Mahasiswa Prodi Bahasa Inggris Unmas Denpasar Sebuah Kajian Fonologi Generatif PDF
No ratings yet
Kesalahan Artikulasi Phonemes Bahasa Inggris Mahasiswa Prodi Bahasa Inggris Unmas Denpasar Sebuah Kajian Fonologi Generatif PDF
11 pages
Tutorial 3 Ans
No ratings yet
Tutorial 3 Ans
4 pages

Phase 2 Aiml

Uploaded by

Phase 2 Aiml

Uploaded by

IBM AIML INTERNSHIP

VTU - Rooman Internship 2024-25

College Name: B. L. D. E. A's V. P. Dr. P. G. Halakatti College of Engineering and

Batch ID: 2753892(AI-Machine Learning)

Group Name: Group A22

Name: Bhagyashree S Patil

 CAN ID Number: CAN_33806693

Name: Bhagyashree S Bhairagond

 CAN ID Number: CAN_33846014

Name: Deepa B Patil

 CAN ID Number: CAN_ 33834957

Name: Gourishankar Mudhol

• CAN ID Number: CAN_33844810

Real-Time Social Media Analytics Pipeline: Building a

Phase 2: Data Preprocessing and Model Design

2.1 Overview of Data Preprocessing

To further enhance model efficiency and avoid overfitting, dimensionality reduction

2.2 Data Cleaning: Handling Missing Values, Outliers, and Inconsistencies

Strategies to Handle Missing Values:

Strategies to Handle Outliers:

Inconsistent data, such as duplicate entries or contradictory information, was handled by

 Duplicate Entries: We used pandas' duplicated() function to identify and remove

 Contradictory Information: For example, if there were conflicting values such as an

2.3 Feature Scaling and Normalization

Steps for Scaling and Normalization:

 Standardization (Z-score normalization):

Code Example for Scaling:

Screenshots of Scaling Process:

2.4 Feature Transformation and Dimensionality Reduction

 Minimize computational cost.

 Principal Component Analysis (PCA):

o Features that contributed minimal value to the clustering process were

2. Importance of Dimensionality Reduction in the Project:

 Enhanced model training speed and efficiency.

2.5 Autoencoder Model Design

2. Loss Function and Optimizer:

 Loss Function: Mean Squared Error (MSE) to minimize reconstruction errors.

2.6 Model Training and Validation

 Reconstruction error was continuously monitored on both training and validation

 The encoder successfully extracted an 8-dimensional latent feature representation.

2.7 Conclusion of Phase 2

 Application of PCA and feature selection to reduce dimensionality, thereby

You might also like