DS203 Exercise 5
DS203 Exercise 5
Exercise - 05
Aryan Kumar 22B2423
Introduction
In this exercise, we embark on a comprehensive journey through data processing and analysis
across various domains. Our task involves delving into three distinct datasets, each presenting
its own unique challenges and opportunities for insight.
Firstly, we encounter the e5-htr-current.csv dataset, tracking the current flow through a
transformer. Our objective here is to conduct Exploratory Data Analysis (EDA), identifying
outliers, missing data, and irregularities in the dataset's behavior over a specified period.
Through meticulous analysis and visualization, we aim to transform potentially unreliable data
into a more robust and informative form.
The final dataset, mnist_test.csv, offers a unique challenge in the realm of image classification.
Here, we explore the application of both PCA and t-SNE (t-distributed Stochastic Neighbor
Embedding) techniques for dimensionality reduction and visualization. Through these methods,
we seek to unravel the underlying structure and patterns within the dataset, enabling more
intuitive comprehension and analysis.
Throughout our journey, we meticulously document our methodologies, analyses, and insights,
culminating in a comprehensive understanding of the datasets and the invaluable lessons
gleaned from this enriching exercise in data exploration and analysis.
1
Problem 1: Dealing with outliers / missing / incorrect data
A. Perform EDA on the data file e5-htr-current.csv to understand the overall nature / quality of
the data. Create a description of the data and document your EDA observations in detail.
In our exploratory data analysis (EDA) of the "e5-htr-current.csv" file, we aimed to gain a
comprehensive understanding of the dataset's nature and quality. Here's a summary of our
approach and observations:
2
There are 1324 missing values in the dataset, while there are no occurrences of "REF!" values.
Missing values can significantly affect the quality and reliability of analyses, so it's essential to
handle them appropriately.
Visual Exploration:
3
The presence of numerous missing values in the original dataset results in gaps, visually
represented as white spaces, in the bar graph.
Axes: The X-axis represents the “Timestamp”, and the Y-axis represents the “HT R Phase Current”.
The Y-axis ranges from 0 to 100, indicating the current’s magnitude at different timestamps.
Data Representation: The data is represented by blue vertical lines. Each line corresponds to the
HT R Phase current at a specific timestamp.
Variations: There are noticeable fluctuations in the current over time. Some peaks are reaching
close to 100 on the Y-axis, indicating high current instances.
4
1. Peak near 0: There’s a significant peak at 0 on the X-axis, indicating that a current level of
0 has the highest frequency. This means that the most common current level is 0.
2. Decline in Frequency: After the peak at 0, there’s a rapid decline in frequency as the
current increases, with minor peaks around 80. This indicates that higher current levels
occur less frequently.
This basically means that the HT R Current is very low at most of the times but surges occur
constantly and the distribution is somewhat straight line but kind of normal near 73 amps.
5
Overall, our EDA process facilitated a comprehensive understanding of the dataset's
characteristics, allowing us to identify key patterns, anomalies, and areas requiring further
attention. This foundational analysis serves as the basis for more advanced modeling and
analysis techniques, guiding subsequent steps in the data analysis pipeline.
B. Identify a 2-week period, where the data seems to be relatively unstable. For example, see the
image below: there are wild fluctuations and missing observations.
6
● Imputation of missing values using interpolation or mean imputation, but in my
case I am using the previous day value.
This is after imputing the missing values from the original dataset -
{e5-htr-current.csv} and saving the newfile to - {e5-htr-current-filled.csv}
7
First Code Snippet:
● Similar to the first snippet, this code also loads the same CSV file but does not preprocess
it for missing values.
● It performs the same operations as the first snippet, calculating the rolling standard
deviation and identifying the unstable period.
● However, before plotting, this code snippet visualizes the time series data after imputing
missing values with the previous day's values.
● It then uses Plotly Express to create a time series plot of the 'HT R Phase Current' data,
highlighting the same unstable period identified earlier.
● Finally, it displays the plot.
Result:
● Both code snippets identify the same 2-week unstable period, which starts on June 17, 2019, at
12:45:00, and ends on July 1, 2019, at 12:45:00.
C. In your report, clearly state the start and end dates you have chosen for such detailed
analysis.
8
D. On this data segment, implement at least 3 methods to remove outliers (if any) / smoothen
the data / impute missing data, thereby converting the relatively bad data into good data.
Here are three methods each for removing outliers, smoothening the data, and imputing missing
data:
Removing Outliers:
Interquartile Range (IQR) Method:
● Calculate the interquartile range (IQR) of the data.
● Identify outliers as data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 *
IQR.
● Remove outliers from the dataset.
Z-Score Method:
● Calculate the z-score for each data point, representing the number of standard
deviations away from the mean.
● Set a threshold for z-scores (e.g., |z-score| > 3) to identify outliers.
● Remove data points with z-scores exceeding the threshold.
Modified Z-Score Method:
● Calculate the modified z-score, which is a robust version of the z-score that is less
sensitive to outliers.
● Set a threshold for modified z-scores (e.g., |modified z-score| > 3.5) to identify
outliers.
● Remove data points with modified z-scores exceeding the threshold.
9
● Apply local regression (LOESS - LOcally WEighted Scatterplot Smoothing) to fit a
curve to the data.
● Use weighted regression to estimate the value of each data point based on
neighboring points.
● Adjust the span parameter to control the degree of smoothing.
10
E. Can you think of a method that uses other good data – elsewhere in the data set – to
guide you into treating the bad region that you have identified? That is, can you use the
global data trend information to make local changes?
One method that can utilize information from other good data elsewhere in the dataset to guide
treatment of the identified bad region is piecewise regression.
Piecewise regression involves fitting multiple linear regression models to different segments of
the data, allowing for different slopes and intercepts in each segment. By identifying stable
segments of the data outside the bad region and fitting regression models to those segments,
you can use the trend information from the good data to estimate the expected values within the
bad region.
F. For each of the methods implemented, describe the steps, and clearly show your final
results (visually + using statistical measures). Adequately justify your decisions!
Sure! Exploratory Data Analysis (EDA) is a crucial step in understanding the dataset's
characteristics and identifying patterns or anomalies. Here's how you might start the EDA
process for the given dataset:
Load the Data: Begin by loading the dataset into your environment using pandas'
read_csv function.
Initial Exploration: Take a quick look at the first few rows of the dataset using the head()
function to understand its structure and contents.
Summary Statistics: Compute summary statistics such as mean, median, standard
deviation, minimum, and maximum values for numerical columns. This gives an overview
of the central tendency and spread of the data.
11
Data Distribution: Visualize the distribution of numerical variables using histograms. This
helps in understanding the data's spread and identifying potential outliers.
Process Parameters: These are variables that directly or indirectly influence the chemical
reactions or operations within the plant. Examples could include temperature, pressure,
flow rates, concentrations of reactants or products, and various other physical or
chemical properties.
Time Series Data: organized as a time series, with measurements taken at regular
intervals over a period of time that is each day.
Subset of Columns: The dataset may contain a inclusion of 100 columns suggesting a
relatively rich dataset with a diverse range of process parameters.
Purpose of Analysis: The dataset could be intended for various purposes, such as
monitoring plant performance, optimizing processes, detecting anomalies or deviations
from expected behavior, or conducting research and analysis for improvements or
troubleshooting.
Data Quality and Integrity: Given the critical nature of process parameters in a chemical
plant, ensuring data quality and integrity is paramount. This includes factors such as
accuracy, precision, consistency, completeness, and reliability of the measurements.
12
Visualization
13
Some initial exploratory data analysis (EDA) on the dataset "e5-Run2-June22-subset-100-cols.csv"
Column 'c1': You've dropped the column 'c1' because it contains dates, which suggests
that it might not be relevant for the analysis of the distribution of other variables.
Handling Missing Values: You've replaced both NaN and '#REF!' values with NaN to
ensure consistency in handling missing data across the dataset.
Data Distribution: You've plotted histograms for each column in the dataset to visualize
the distribution of values. Based on your visual inspection:
● Columns 'c2', 'c11', 'c14', 'c36', 'c37', and 'c55' appear to be more or less constant,
indicating that they may not provide much variability or useful information for
further analysis.
● Other columns seem to follow a Gaussian or normal distribution, suggesting that
they might be suitable for various statistical analyses and modeling techniques.
14
Missing Columns in the Plot: It's noted that columns 'c82', 'c88', 'c96', and 'c97' are not
present in the plot, possibly due to missing values or errors in those specific columns.
Upon visual inspection, we can see that c2, c11, c14, c36,c37,c55 is more or less constant but
others are following gaussian or normal distribution and also because of some missing value or
some error the columns c82,c88,c96,c97 are not there in the plot.
After Imputating with moving average, the newer plot has all the 99 cols
15
16
B. Process each of the columns to resolve each of the following matters, and implement the
solutions:
a. Do you want to keep or discard the column? What is your basis for these decisions?
Quality of data: Assess the quality of data in each column. Columns with a high
proportion of missing values or errors may not provide meaningful insights and could be
candidates for removal.
17
Variability: Check the variability of data in each column. If a column has low variability,
meaning most of the values are the same or very similar, it may not provide much
information and could be considered for removal.
Domain knowledge: Consider any domain-specific knowledge that could inform the
decision to keep or discard a column. Some columns may be critical for domain-specific
reasons even if they appear less relevant from a statistical perspective.
b. Are there any outliers / missing values / wrong values in the data? If so, how will you fix
them? Fix them!
Yes there are many outliers which can be easily seen in box plot, after running the code to
find missing values, here is the plot to convey the same
To identify and fix outliers, missing values, or wrong values in the data, we can follow
these steps:
Identify Outliers:
● Outliers can be detected using statistical methods such as Z-score, IQR
(Interquartile Range), or visual inspection through box plots.
● We can calculate the Z-score for each data point and identify those that fall
beyond a certain threshold (e.g., Z-score greater than 3 or less than -3).
Handle Outliers:
● Outliers can be treated in several ways, including:
● Removing them if they are data entry errors.
● Transforming them using techniques like winsorization or log
transformation.
18
● Imputing them with a more suitable value (e.g., mean, median, or a
value based on domain knowledge).
Identify Missing Values:
● Missing values can be identified using functions like isnull() or isna() in
pandas.
● Visual inspection of the data or summary statistics like counts can also
reveal missing values.
Handle Missing Values:
● Missing values can be handled by:
● Removing rows or columns with a large proportion of missing values
if they don't contribute significantly to the analysis.
● Imputing missing values using methods like mean, median, mode, or
moving average, here i used moving average.
● For time-series data, missing values can sometimes be interpolated
using methods like linear interpolation or time-series forecasting
techniques.
Identify Wrong Values:
● Wrong values can be identified through domain knowledge or by comparing
values against known constraints or valid ranges.
Handle Wrong Values:
● Wrong values can be corrected based on domain knowledge or by replacing
them with valid values.
Yes, Standardization (Z-score normalization) it will scales each feature so that it has a
mean of 0 and a standard deviation of 1. It preserves the shape of the original distribution
while centering the data around 0.
d. Which of the columns are correlated? Create a correlation heat map and state your
observations.
19
It seems that c2 and c82 is fully constant.
e. You must deal with the correlated columns. Which columns will you remove from the
consideration of further analysis? (as you may not have enough information about the
domain, take appropriate calls in case of conflicts)
To decide which columns to remove from further analysis due to high correlation, we can
follow these guidelines:
Retain one representative from each cluster: Since highly correlated variables tend to
provide redundant information, we can choose to keep only one representative from each
cluster. This can help reduce multicollinearity issues and simplify the analysis.
Consider domain knowledge: If there are columns that are known to be irrelevant or
redundant based on domain knowledge or business context, those can be removed
regardless of their correlation with other variables.
20
Prioritize columns with less impact: If there are columns that are less important or have
a weaker impact on the target variable compared to others in the same cluster, those can
be candidates for removal.
Index(['c2', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19', 'c21', 'c22', 'c23',
'c27', 'c29', 'c30', 'c31', 'c33', 'c34', 'c35', 'c36', 'c37', 'c40', 'c41', 'c42', 'c44', 'c45', 'c46', 'c49', 'c51',
'c52', 'c53', 'c55', 'c58', 'c60', 'c61', 'c62', 'c63', 'c64', 'c65', 'c66', 'c67', 'c68', 'c69', 'c70', 'c71', 'c72',
'c73', 'c74', 'c75', 'c76', 'c77', 'c78', 'c79', 'c80', 'c81', 'c82', 'c83', 'c84', 'c85', 'c86', 'c88', 'c89', 'c90',
'c91', 'c92', 'c93', 'c94', 'c95', 'c96', 'c97', 'c98', 'c99', 'c100'],
dtype='object')
['c3', 'c4', 'c5', 'c6', 'c20', 'c24', 'c25', 'c26', 'c28', 'c32', 'c38', 'c39', 'c43', 'c47', 'c48', 'c50', 'c54', 'c56',
'c57', 'c59', 'c87']
To identify columns with multicollinearity relationships, you can calculate the Variance
Inflation Factor (VIF) for each column. The VIF measures how much the variance of an
estimated regression coefficient increases if your predictors are correlated. High VIF
values indicate multicollinearity.
g. Which of these columns will you discard? Decide, by stating the applicable reasons,
and discard those columns.
h. Perform PCA on the data at the two stages i and ii mentioned below, and analyze the
results.
iii. Do not forget to create and understand the elbow diagram in the context of
PCA.
21
● In the elbow diagram, the point where the explained variance ratio begins to
plateau indicates the number of principal components that adequately describe
iv. Interpret the diagram and decide how many principal axes can adequately
describe the Dataset.
22
The diagram I analyzed is a plot of the cumulative explained variance ratio as a
function of the number of principal components. This plot is typically used to
determine the optimal number of principal components in PCA, here 99 present of
variance is explained by 8-10 Principal components
Key Observations:
1. The cumulative explained variance ratio reaches close to 1 very quickly. This
suggests that a small number of principal components can capture most of
the variance in the data.
2. Each principal axis (or component) in PCA captures a certain amount of the
total variance in the data. The first principal axis captures the most variance,
the second principal axis (orthogonal to the first) captures the second most,
and so on.
3. The ‘elbow’ or the point where the increase in explained variance ratio
becomes less significant is often used as a cutoff point for selecting the
number of principal components. In this case, the ‘elbow’ is not clearly
visible due to the rapid increase to near 1. However, it appears that just one
or a few principal axes can adequately describe the dataset as the
cumulative explained variance ratio reaches close to 1 very quickly.
C. For each of the above tasks, describe the steps, and clearly show your final results (visually +
using statistical measures). Adequately justify all your decisions!
Objective: Perform Principal Component Analysis (PCA) on the mnist_test.csv dataset to reduce
dimensionality and visualize the data.
Steps:
Data Preprocessing:
● Load the mnist_test.csv dataset.
● Handle missing values by imputing them with the mean of each column.
23
● Handle outliers by applying RobustScaler, which is less sensitive to outliers and
can handle NaN values.
PCA Analysis:
● Fit PCA to the preprocessed data to identify principal components.
● Calculate explained variance ratio for each principal component.
● Calculate cumulative explained variance ratio.
Visualization:
● Plot the explained variance ratio to understand the contribution of each principal
component.
● Plot the cumulative explained variance ratio to determine the number of
components required to capture a certain amount of variance.
● Plot a scatter plot of the first two principal components to visualize the data in
reduced dimensions.
Statistics:
● Calculate the mean and standard deviation of the explained variance ratio to
understand the variability captured by each principal component.
24
Interpretation: From the plot, we can see that the explained variance ratio sharply decreases
until around 100 components and then flattens out, forming an ‘elbow’ shape. This suggests that
around 100 components can be considered as an optimal number because beyond this point,
adding more components does not significantly increase the explained variance ratio.
C. Create the scatter plot PC2 v/s PC1 and interpret the results
D. Subject the file mnist_test.csv to t-SNE analysis, to map the data to 2 dimensions
E. Visualize the mapped data by creating a scatter plot of these 2 dimensions, and
interpret the results
25
1. t-SNE Mapping: t-SNE is a technique for dimensionality reduction that is particularly well
suited for the visualization of high-dimensional datasets. The axes, labeled “t-SNE
Dimension 1” and “t-SNE Dimension 2”, represent the two principal components from the
t-SNE reduction. Each point in this plot represents a data point in the high-dimensional
dataset.
2. Clusters: The clusters of blue dots scattered across the plot represent the t-SNE mapped
data points. Each cluster of points might represent a grouping or classification within the
data. The proximity of points can be interpreted as a measure of similarity.
3. Interpretation: The exact interpretation of this plot depends on the nature of the original
data. However, in general, points that are close together in this plot are similar in some
sense in the original high-dimensional dataset, and points that are far apart are dissimilar.
26
apply t-SNE to reduce the dimensionality of the dataset to two dimensions, allowing us to
visualize the data in a scatter plot. Let's proceed with these steps:
Preprocessing: We'll handle any missing values and standardize or normalize the features
as necessary.
t-SNE Analysis: Applying t-SNE to reduce the dimensionality of the dataset to two
dimensions.
Visualization and Analysis: Creating a scatter plot of the two-dimensional t-SNE
embeddings and analyzing the results for any emerging patterns or clusters.
1. t-SNE Mapping: t-SNE is a technique for dimensionality reduction that is particularly well
suited for the visualization of high-dimensional datasets. The axes, labeled “t-SNE
Dimension 1” and “t-SNE Dimension 2”, represent the two principal components from the
t-SNE reduction. Each point in this plot represents a data point in the high-dimensional
dataset.
2. Clusters: The clusters of blue dots scattered across the plot represent the t-SNE mapped
data points. Each cluster of points might represent a grouping or classification within the
data. The proximity of points can be interpreted as a measure of similarity.
27
3. Interpretation: The exact interpretation of this plot depends on the nature of the original
data. However, in general, points that are close together in this plot are similar in some
sense in the original high-dimensional dataset, and points that are far apart are dissimilar.
7. Major Learnings
Major Learnings:
28
● Justifying decisions using statistical measures, visualization, and logical
reasoning.
Elbow Method:
● Utilizing the elbow method to determine the optimal number of clusters or principal
components in unsupervised learning tasks.
● Understanding the trade-off between explained variance and complexity to select
the appropriate number of components.
Visualization Techniques:
● Leveraging various visualization techniques such as scatter plots, histograms, box
plots, and heatmaps to explore and communicate data insights effectively.
● Choosing the most suitable visualization method based on the data characteristics
and analysis objectives.
Statistical Measures:
● Calculating descriptive statistics such as mean, median, standard deviation, and
quartiles to summarize data distributions.
● Using statistical tests and metrics to evaluate the effectiveness of data
preprocessing techniques and analysis results.
Domain Knowledge Integration:
● Integrating domain knowledge and context into data analysis and decision-making
processes to ensure the relevance and applicability of findings.
● Collaborating with domain experts to interpret results and derive actionable
insights from data analysis.
29