Phase 2 Aiml
Phase 2 Aiml
Project Title: Real-Time Social Media Analytics Pipeline: Building a Robust Data
Processing Framework
Group Members:
1|Page
IBM AIML INTERNSHIP
The primary objective of this project's second phase is to complete the essential data
preprocessing tasks required to prepare the real-time social media analytics dataset for deep
learning. The first step in this process is data cleaning, which involves identifying and
resolving common problems such as missing values, outliers, and inconsistencies. Statistical
techniques and visualizations are employed to identify missing values, with appropriate
imputation or removal strategies used based on the nature of the data. Using boxplots and
statistical methods like the Z-score, outliers that could distort model outcomes are detected
and either capped or removed. Additionally, any discrepancies in the dataset, such as
duplicate entries or contradictory information, are eliminated to ensure data integrity.
After data cleaning, feature scaling and normalization are applied. These steps ensure that all
features are on a similar scale, which is crucial for deep learning models, as they are sensitive
to the scale of input features. Methods like min-max scaling and Z-score normalization are
used to standardize features and prevent those with larger magnitudes from dominating the
learning process. This ensures that the data is in the right format for efficient processing by
deep learning models, such as autoencoders, in the context of real-time social media
analytics.
2|Page
IBM AIML INTERNSHIP
To ensure that the data is reliable and ready for modeling, cleaning was a crucial step in this
phase. The following techniques were employed:
Missing Values:
Missing values can lead to biased results or model failures, impairing the model’s overall
performance. To handle missing values, the following methods were applied:
Descriptive Statistics: We analyzed the data using measures like the mean and
median to identify trends and patterns in missing data.
Heatmaps: Visualizing missing data through heatmaps allowed us to pinpoint areas
where imputation was required.
Imputation: For numerical features, we used the mean or median for imputation,
and for categorical features, we used the mode to fill in missing values.
Removal: To maintain dataset integrity, we removed rows that had an excessive
number of missing values, ensuring that the imputation process didn’t distort the
analysis.
Outliers:
Outliers can significantly impact models, especially distance-based algorithms like K-means
or autoencoders. The following methods were used to identify and manage outliers:
Boxplots: Boxplots were used to visually identify outliers in the dataset for each
feature.
Z-Score Method: This statistical technique was employed to detect data points that
deviate significantly from the mean.
Capping/Clipping: For extreme outliers, the Z-score threshold was applied to limit
data within a specified range, ensuring that these values do not excessively influence
model training.
Removal: Outliers that were deemed errors or abnormal data points were completely
removed from the dataset to ensure data consistency and model accuracy.
Inconsistencies:
3|Page
IBM AIML INTERNSHIP
Feature scaling ensures that all features are comparable in magnitude, which is essential for
deep learning models like autoencoders. Without scaling, features with larger magnitudes
could dominate the learning process, leading to inaccurate model performance. The two main
techniques used for scaling and normalization in this project are Standardization (Z-score
normalization) and Min-Max Scaling.
Where X minx and Xmax are the minimum and maximum values of the feature, respectively.
Min-Max Scaling:
o This method rescales the data to a fixed range, usually between [0, 1],
ensuring that all feature values lie within this range.
o Formula:
X−max(X)
×′ = min(x)−max(x)
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization
4|Page
IBM AIML INTERNSHIP
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_minmax_scaled = min_max_scaler.fit_transform(data)
To prepare the dataset for deep learning, feature transformation and dimensionality reduction
were critical steps to optimize performance and reduce complexity.
1. Dimensionality Reduction:
Given the high number of features in the dataset, dimensionality reduction techniques were
applied to:
Techniques Used:
5|Page
IBM AIML INTERNSHIP
The focus was on designing and implementing an autoencoder model to perform deep
clustering by compressing data into a latent feature space.
1. Model Architecture:
Encoder Architecture:
o Input Layer: Accepts the transformed feature vector.
o Dense Layers: Progressively reduces the feature space size (e.g., 64 → 32 →
8 neurons).
o Final Latent Layer: Compresses the data into an 8-dimensional latent space.
Decoder Architecture:
o Dense Layers: Mirrors the encoder by expanding the latent representation
back to the original feature space.
o Output Layer: Utilizes a sigmoid activation function to normalize
reconstructed values between 0 and 1.
With the autoencoder designed, the training process involved the following steps:
1. Data Split:
The dataset was split into training and validation sets to evaluate model performance
and prevent overfitting.
2. Training Configuration:
Epochs: 50 epochs.
Batch Size: 32.
3. Monitoring Performance:
6|Page
IBM AIML INTERNSHIP
4. Outcome:
Phase 2 focused on transforming and preparing the dataset for deep learning. Key
achievements included:
The groundwork established in this phase ensures a robust platform for segmentation and
clustering, which will drive actionable insights in subsequent phases of the project.
7|Page