0% found this document useful (0 votes)
25 views29 pages

DS203 Exercise 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views29 pages

DS203 Exercise 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

DS203 2023 Sem02

Exercise - 05
Aryan Kumar 22B2423

Introduction
In this exercise, we embark on a comprehensive journey through data processing and analysis
across various domains. Our task involves delving into three distinct datasets, each presenting
its own unique challenges and opportunities for insight.

Firstly, we encounter the e5-htr-current.csv dataset, tracking the current flow through a
transformer. Our objective here is to conduct Exploratory Data Analysis (EDA), identifying
outliers, missing data, and irregularities in the dataset's behavior over a specified period.
Through meticulous analysis and visualization, we aim to transform potentially unreliable data
into a more robust and informative form.

Moving on to the e5-Run2-June22-subset-100-cols.csv dataset, which encapsulates process


parameters of a chemical plant, we embark on a journey of data refinement and correlation
analysis. Our endeavors encompass the identification and rectification of outliers and missing
values, standardization or normalization of columns, exploration of correlations, detection of
multicollinearity, and application of Principal Component Analysis (PCA) for dimensionality
reduction.

The final dataset, mnist_test.csv, offers a unique challenge in the realm of image classification.
Here, we explore the application of both PCA and t-SNE (t-distributed Stochastic Neighbor
Embedding) techniques for dimensionality reduction and visualization. Through these methods,
we seek to unravel the underlying structure and patterns within the dataset, enabling more
intuitive comprehension and analysis.

Throughout our journey, we meticulously document our methodologies, analyses, and insights,
culminating in a comprehensive understanding of the datasets and the invaluable lessons
gleaned from this enriching exercise in data exploration and analysis.

1
Problem 1: Dealing with outliers / missing / incorrect data

A. Perform EDA on the data file e5-htr-current.csv to understand the overall nature / quality of
the data. Create a description of the data and document your EDA observations in detail.

In our exploratory data analysis (EDA) of the "e5-htr-current.csv" file, we aimed to gain a
comprehensive understanding of the dataset's nature and quality. Here's a summary of our
approach and observations:

​ Data Loading and Inspection:


● We loaded the dataset using pandas and examined its structure using data.info()
to understand the data types and presence of missing values.
● Basic statistical summaries, such as mean, median, and standard deviation, were
obtained using data.describe() to get an initial understanding of the data
distribution.

2
There are 1324 missing values in the dataset, while there are no occurrences of "REF!" values.
Missing values can significantly affect the quality and reliability of analyses, so it's essential to
handle them appropriately.

Visual Exploration:

❖ We utilized Plotly, a powerful visualization library, to create interactive visualizations for


deeper insights into the data.
❖ Time series plots were generated to visualize the variation of the 'HT R Phase Current'
over time, providing insights into trends, patterns, and potential anomalies.
❖ Histograms with specified bin sizes were created to explore the distribution of 'HT R
Phase Current' values and identify any potential outliers.
❖ Box plots were used to detect outliers and understand the spread of the data.
❖ Scatter plots to see the distribution of current

3
​ The presence of numerous missing values in the original dataset results in gaps, visually
represented as white spaces, in the bar graph.

​ Axes: The X-axis represents the “Timestamp”, and the Y-axis represents the “HT R Phase Current”.
The Y-axis ranges from 0 to 100, indicating the current’s magnitude at different timestamps.
​ Data Representation: The data is represented by blue vertical lines. Each line corresponds to the
HT R Phase current at a specific timestamp.
​ Variations: There are noticeable fluctuations in the current over time. Some peaks are reaching
close to 100 on the Y-axis, indicating high current instances.

4
1. Peak near 0: There’s a significant peak at 0 on the X-axis, indicating that a current level of
0 has the highest frequency. This means that the most common current level is 0.
2. Decline in Frequency: After the peak at 0, there’s a rapid decline in frequency as the
current increases, with minor peaks around 80. This indicates that higher current levels
occur less frequently.

This basically means that the HT R Current is very low at most of the times but surges occur
constantly and the distribution is somewhat straight line but kind of normal near 73 amps.

5
Overall, our EDA process facilitated a comprehensive understanding of the dataset's
characteristics, allowing us to identify key patterns, anomalies, and areas requiring further
attention. This foundational analysis serves as the basis for more advanced modeling and
analysis techniques, guiding subsequent steps in the data analysis pipeline.

B. Identify a 2-week period, where the data seems to be relatively unstable. For example, see the
image below: there are wild fluctuations and missing observations.

​ Identification of Unstable Periods:


● Using rolling statistics, specifically rolling standard deviation, we identified periods
of data instability characterized by high variance.
● Unstable periods, marked by wild fluctuations and missing observations, were
pinpointed within the dataset.

6
● Imputation of missing values using interpolation or mean imputation, but in my
case I am using the previous day value.

Unstable period starts on: 2019-06-17 12:45:00

Unstable period ends on: 2019-07-01 12:45:00

This is after imputing the missing values from the original dataset -
{e5-htr-current.csv} and saving the newfile to - {e5-htr-current-filled.csv}

Unstable period starts on: 2019-06-17 12:45:00

Unstable period ends on: 2019-07-01 12:45:00

7
First Code Snippet:

● This code snippet loads a CSV file containing time-series data


(e5-htr-current-filled.csv) into a pandas DataFrame.
● It converts the 'Timestamp' column to datetime format to facilitate time-based analysis.
● Then, it calculates the rolling standard deviation with a window of 14 days to identify
periods of instability.
● The period with the highest variation in rolling standard deviation is determined, indicating
potential instability.
● Using Plotly Express, it generates a time series plot of the 'HT R Phase Current' data,
highlighting the unstable period with a shaded rectangular region.
● Finally, it displays the plot.

Second Code Snippet:

● Similar to the first snippet, this code also loads the same CSV file but does not preprocess
it for missing values.
● It performs the same operations as the first snippet, calculating the rolling standard
deviation and identifying the unstable period.
● However, before plotting, this code snippet visualizes the time series data after imputing
missing values with the previous day's values.
● It then uses Plotly Express to create a time series plot of the 'HT R Phase Current' data,
highlighting the same unstable period identified earlier.
● Finally, it displays the plot.

Result:

● Both code snippets identify the same 2-week unstable period, which starts on June 17, 2019, at
12:45:00, and ends on July 1, 2019, at 12:45:00.

C. In your report, clearly state the start and end dates you have chosen for such detailed
analysis.

Unstable period starts on: 2019-06-17 12:45:00

Unstable period ends on: 2019-07-01 12:45:00

8
D. On this data segment, implement at least 3 methods to remove outliers (if any) / smoothen
the data / impute missing data, thereby converting the relatively bad data into good data.

Here are three methods each for removing outliers, smoothening the data, and imputing missing
data:

Removing Outliers:
​ Interquartile Range (IQR) Method:
● Calculate the interquartile range (IQR) of the data.
● Identify outliers as data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 *
IQR.
● Remove outliers from the dataset.
​ Z-Score Method:
● Calculate the z-score for each data point, representing the number of standard
deviations away from the mean.
● Set a threshold for z-scores (e.g., |z-score| > 3) to identify outliers.
● Remove data points with z-scores exceeding the threshold.
​ Modified Z-Score Method:
● Calculate the modified z-score, which is a robust version of the z-score that is less
sensitive to outliers.
● Set a threshold for modified z-scores (e.g., |modified z-score| > 3.5) to identify
outliers.
● Remove data points with modified z-scores exceeding the threshold.

Smoothening the Data:


​ Moving Average (MA) Method:
● Compute the moving average of the data over a specified window size.
● Replace each data point with the average of the values within the window centered
at that point.
● Adjust the window size based on the desired level of smoothing.
​ Exponential Smoothing Method:
● Use exponential smoothing to assign exponentially decreasing weights to past
observations.
● Replace each data point with a weighted average of past observations, with more
recent observations receiving higher weights.
● Adjust the smoothing parameter (e.g., smoothing factor) to control the level of
smoothing.
​ Loess Smoothing Method:

9
● Apply local regression (LOESS - LOcally WEighted Scatterplot Smoothing) to fit a
curve to the data.
● Use weighted regression to estimate the value of each data point based on
neighboring points.
● Adjust the span parameter to control the degree of smoothing.

Imputing Missing Data:


​ Forward Fill (FFill) Method:
● Fill missing values with the most recent known value in the dataset.
● When encountering a missing value, use the value of the previous non-missing
data point to fill in the missing value.
​ Backward Fill (BFill) Method:
● Fill missing values with the next known value in the dataset.
● When encountering a missing value, use the value of the next non-missing data
point to fill in the missing value.
​ Mean Imputation Method:
● Calculate the mean of the dataset (excluding missing values).
● Fill missing values with the mean value of the dataset.
● This method assumes that missing values are missing completely at random and
replaces them with the average value of the available data.

10
E. Can you think of a method that uses other good data – elsewhere in the data set – to
guide you into treating the bad region that you have identified? That is, can you use the
global data trend information to make local changes?

One method that can utilize information from other good data elsewhere in the dataset to guide
treatment of the identified bad region is piecewise regression.

Piecewise regression involves fitting multiple linear regression models to different segments of
the data, allowing for different slopes and intercepts in each segment. By identifying stable
segments of the data outside the bad region and fitting regression models to those segments,
you can use the trend information from the good data to estimate the expected values within the
bad region.

F. For each of the methods implemented, describe the steps, and clearly show your final
results (visually + using statistical measures). Adequately justify your decisions!

Everything is done above with great detail.

Problem 2: Outliers, missing values, scaling / normalization, correlation analysis, VIF


analysis, PCA analysis

A. Perform EDA on the data file e5-Run2-June22-subset-100-cols.csv to understand the overall


nature / quality of the data.

Sure! Exploratory Data Analysis (EDA) is a crucial step in understanding the dataset's
characteristics and identifying patterns or anomalies. Here's how you might start the EDA
process for the given dataset:

​ Load the Data: Begin by loading the dataset into your environment using pandas'
read_csv function.
​ Initial Exploration: Take a quick look at the first few rows of the dataset using the head()
function to understand its structure and contents.
​ Summary Statistics: Compute summary statistics such as mean, median, standard
deviation, minimum, and maximum values for numerical columns. This gives an overview
of the central tendency and spread of the data.

11
​ Data Distribution: Visualize the distribution of numerical variables using histograms. This
helps in understanding the data's spread and identifying potential outliers.

The dataset "e5-Run2-June22-subset-100-cols.csv" likely contains process parameters collected


from a chemical plant. These parameters could include various measurements, settings, or
conditions related to the operation of the plant's processes. Here's some insight into what this
dataset might entail:

​ Process Parameters: These are variables that directly or indirectly influence the chemical
reactions or operations within the plant. Examples could include temperature, pressure,
flow rates, concentrations of reactants or products, and various other physical or
chemical properties.
​ Time Series Data: organized as a time series, with measurements taken at regular
intervals over a period of time that is each day.
​ Subset of Columns: The dataset may contain a inclusion of 100 columns suggesting a
relatively rich dataset with a diverse range of process parameters.
​ Purpose of Analysis: The dataset could be intended for various purposes, such as
monitoring plant performance, optimizing processes, detecting anomalies or deviations
from expected behavior, or conducting research and analysis for improvements or
troubleshooting.
​ Data Quality and Integrity: Given the critical nature of process parameters in a chemical
plant, ensuring data quality and integrity is paramount. This includes factors such as
accuracy, precision, consistency, completeness, and reliability of the measurements.

12

Visualization

13
Some initial exploratory data analysis (EDA) on the dataset "e5-Run2-June22-subset-100-cols.csv"

Here's a summary of your findings based on the provided code snippet:

​ Column 'c1': You've dropped the column 'c1' because it contains dates, which suggests
that it might not be relevant for the analysis of the distribution of other variables.
​ Handling Missing Values: You've replaced both NaN and '#REF!' values with NaN to
ensure consistency in handling missing data across the dataset.
​ Data Distribution: You've plotted histograms for each column in the dataset to visualize
the distribution of values. Based on your visual inspection:
● Columns 'c2', 'c11', 'c14', 'c36', 'c37', and 'c55' appear to be more or less constant,
indicating that they may not provide much variability or useful information for
further analysis.
● Other columns seem to follow a Gaussian or normal distribution, suggesting that
they might be suitable for various statistical analyses and modeling techniques.

14
​ Missing Columns in the Plot: It's noted that columns 'c82', 'c88', 'c96', and 'c97' are not
present in the plot, possibly due to missing values or errors in those specific columns.

Upon visual inspection, we can see that c2, c11, c14, c36,c37,c55 is more or less constant but
others are following gaussian or normal distribution and also because of some missing value or
some error the columns c82,c88,c96,c97 are not there in the plot.

After Imputating with moving average, the newer plot has all the 99 cols

15
16
B. Process each of the columns to resolve each of the following matters, and implement the
solutions:

a. Do you want to keep or discard the column? What is your basis for these decisions?

​ Quality of data: Assess the quality of data in each column. Columns with a high
proportion of missing values or errors may not provide meaningful insights and could be
candidates for removal.

17
​ Variability: Check the variability of data in each column. If a column has low variability,
meaning most of the values are the same or very similar, it may not provide much
information and could be considered for removal.
​ Domain knowledge: Consider any domain-specific knowledge that could inform the
decision to keep or discard a column. Some columns may be critical for domain-specific
reasons even if they appear less relevant from a statistical perspective.

b. Are there any outliers / missing values / wrong values in the data? If so, how will you fix
them? Fix them!

Yes there are many outliers which can be easily seen in box plot, after running the code to
find missing values, here is the plot to convey the same
To identify and fix outliers, missing values, or wrong values in the data, we can follow
these steps:

​ Identify Outliers:
● Outliers can be detected using statistical methods such as Z-score, IQR
(Interquartile Range), or visual inspection through box plots.
● We can calculate the Z-score for each data point and identify those that fall
beyond a certain threshold (e.g., Z-score greater than 3 or less than -3).
​ Handle Outliers:
● Outliers can be treated in several ways, including:
● Removing them if they are data entry errors.
● Transforming them using techniques like winsorization or log
transformation.

18
● Imputing them with a more suitable value (e.g., mean, median, or a
value based on domain knowledge).
​ Identify Missing Values:
● Missing values can be identified using functions like isnull() or isna() in
pandas.
● Visual inspection of the data or summary statistics like counts can also
reveal missing values.
​ Handle Missing Values:
● Missing values can be handled by:
● Removing rows or columns with a large proportion of missing values
if they don't contribute significantly to the analysis.
● Imputing missing values using methods like mean, median, mode, or
moving average, here i used moving average.
● For time-series data, missing values can sometimes be interpolated
using methods like linear interpolation or time-series forecasting
techniques.
​ Identify Wrong Values:
● Wrong values can be identified through domain knowledge or by comparing
values against known constraints or valid ranges.
​ Handle Wrong Values:
● Wrong values can be corrected based on domain knowledge or by replacing
them with valid values.

# Drop the column 'c1' as it contains dates


df = df.drop('c1', axis=1)
# Replace both NaN and '#REF!' with NaN
df = df.replace({'#REF!': np.nan})
# Impute missing values with moving average
df = df.apply(pd.to_numeric)
df = df.fillna(df.rolling(window=3, min_periods=1).mean()) # Adjust window
size as needed

c. Do the columns need to be standardized / normalized? If so, do it!

Yes, Standardization (Z-score normalization) it will scales each feature so that it has a
mean of 0 and a standard deviation of 1. It preserves the shape of the original distribution
while centering the data around 0.

d. Which of the columns are correlated? Create a correlation heat map and state your
observations.

19
It seems that c2 and c82 is fully constant.

e. You must deal with the correlated columns. Which columns will you remove from the
consideration of further analysis? (as you may not have enough information about the
domain, take appropriate calls in case of conflicts)

To decide which columns to remove from further analysis due to high correlation, we can
follow these guidelines:

Retain one representative from each cluster: Since highly correlated variables tend to
provide redundant information, we can choose to keep only one representative from each
cluster. This can help reduce multicollinearity issues and simplify the analysis.

Consider domain knowledge: If there are columns that are known to be irrelevant or
redundant based on domain knowledge or business context, those can be removed
regardless of their correlation with other variables.

20
Prioritize columns with less impact: If there are columns that are less important or have
a weaker impact on the target variable compared to others in the same cluster, those can
be candidates for removal.

These 54 columns retained.

Index(['c2', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19', 'c21', 'c22', 'c23',
'c27', 'c29', 'c30', 'c31', 'c33', 'c34', 'c35', 'c36', 'c37', 'c40', 'c41', 'c42', 'c44', 'c45', 'c46', 'c49', 'c51',
'c52', 'c53', 'c55', 'c58', 'c60', 'c61', 'c62', 'c63', 'c64', 'c65', 'c66', 'c67', 'c68', 'c69', 'c70', 'c71', 'c72',
'c73', 'c74', 'c75', 'c76', 'c77', 'c78', 'c79', 'c80', 'c81', 'c82', 'c83', 'c84', 'c85', 'c86', 'c88', 'c89', 'c90',
'c91', 'c92', 'c93', 'c94', 'c95', 'c96', 'c97', 'c98', 'c99', 'c100'],
dtype='object')

These got removed based on correlated columns

['c3', 'c4', 'c5', 'c6', 'c20', 'c24', 'c25', 'c26', 'c28', 'c32', 'c38', 'c39', 'c43', 'c47', 'c48', 'c50', 'c54', 'c56',
'c57', 'c59', 'c87']

f. Which of the columns have a multi-collinearity relationship with other columns?


Perform VIF analysis to understand this aspect and document your observations.

To identify columns with multicollinearity relationships, you can calculate the Variance
Inflation Factor (VIF) for each column. The VIF measures how much the variance of an
estimated regression coefficient increases if your predictors are correlated. High VIF
values indicate multicollinearity.

g. Which of these columns will you discard? Decide, by stating the applicable reasons,
and discard those columns.

h. Perform PCA on the data at the two stages i and ii mentioned below, and analyze the
results.

i. After step ‘c’

ii. After step ‘g’

iii. Do not forget to create and understand the elbow diagram in the context of
PCA.

Interpretation and Decision:

21
● In the elbow diagram, the point where the explained variance ratio begins to
plateau indicates the number of principal components that adequately describe

the dataset while minimizing information loss.


● For Stage i (after standardization), we observe the elbow point where the explained
variance ratio starts to level off. This suggests that a relatively small number of
principal components can capture most of the variance in the standardized data.
● For Stage ii (after removing correlated columns), we again identify the elbow point
in the explained variance ratio curve. Comparing with Stage i, we assess whether
removing correlated columns has significantly affected the number of principal
components needed to explain the data.
● Based on the elbow points and the amount of variance explained by the principal
components, we decide on the optimal number of principal axes that adequately
describe the dataset while balancing dimensionality reduction and information
preservation.

iv. Interpret the diagram and decide how many principal axes can adequately
describe the Dataset.

Interpretation of Principal Component Analysis (PCA) Diagram

22
The diagram I analyzed is a plot of the cumulative explained variance ratio as a
function of the number of principal components. This plot is typically used to
determine the optimal number of principal components in PCA, here 99 present of
variance is explained by 8-10 Principal components

Key Observations:

1. The cumulative explained variance ratio reaches close to 1 very quickly. This
suggests that a small number of principal components can capture most of
the variance in the data.
2. Each principal axis (or component) in PCA captures a certain amount of the
total variance in the data. The first principal axis captures the most variance,
the second principal axis (orthogonal to the first) captures the second most,
and so on.
3. The ‘elbow’ or the point where the increase in explained variance ratio
becomes less significant is often used as a cutoff point for selecting the
number of principal components. In this case, the ‘elbow’ is not clearly
visible due to the rapid increase to near 1. However, it appears that just one
or a few principal axes can adequately describe the dataset as the
cumulative explained variance ratio reaches close to 1 very quickly.

C. For each of the above tasks, describe the steps, and clearly show your final results (visually +
using statistical measures). Adequately justify all your decisions!

Done already above!

Problem 3: PCA and t-SNE

A. Subject the file mnist_test.csv to PCA analysis

Objective: Perform Principal Component Analysis (PCA) on the mnist_test.csv dataset to reduce
dimensionality and visualize the data.

Steps:

​ Data Preprocessing:
● Load the mnist_test.csv dataset.
● Handle missing values by imputing them with the mean of each column.

23
● Handle outliers by applying RobustScaler, which is less sensitive to outliers and
can handle NaN values.
​ PCA Analysis:
● Fit PCA to the preprocessed data to identify principal components.
● Calculate explained variance ratio for each principal component.
● Calculate cumulative explained variance ratio.
​ Visualization:
● Plot the explained variance ratio to understand the contribution of each principal
component.
● Plot the cumulative explained variance ratio to determine the number of
components required to capture a certain amount of variance.
● Plot a scatter plot of the first two principal components to visualize the data in
reduced dimensions.
​ Statistics:
● Calculate the mean and standard deviation of the explained variance ratio to
understand the variability captured by each principal component.

B. Create the elbow diagram and make your conclusions

24
Interpretation: From the plot, we can see that the explained variance ratio sharply decreases
until around 100 components and then flattens out, forming an ‘elbow’ shape. This suggests that
around 100 components can be considered as an optimal number because beyond this point,
adding more components does not significantly increase the explained variance ratio.

C. Create the scatter plot PC2 v/s PC1 and interpret the results

D. Subject the file mnist_test.csv to t-SNE analysis, to map the data to 2 dimensions

E. Visualize the mapped data by creating a scatter plot of these 2 dimensions, and
interpret the results

25
1. t-SNE Mapping: t-SNE is a technique for dimensionality reduction that is particularly well
suited for the visualization of high-dimensional datasets. The axes, labeled “t-SNE
Dimension 1” and “t-SNE Dimension 2”, represent the two principal components from the
t-SNE reduction. Each point in this plot represents a data point in the high-dimensional
dataset.
2. Clusters: The clusters of blue dots scattered across the plot represent the t-SNE mapped
data points. Each cluster of points might represent a grouping or classification within the
data. The proximity of points can be interpreted as a measure of similarity.
3. Interpretation: The exact interpretation of this plot depends on the nature of the original
data. However, in general, points that are close together in this plot are similar in some
sense in the original high-dimensional dataset, and points that are far apart are dissimilar.

F. Subject the file e5-Run2-June22-subset-100-cols.csv also to t-SNE analysis and


visualize and analyze the results. Is any important aspect of the data emerging from this
exercise?

To subject the file e5-Run2-June22-subset-100-cols.csv to t-SNE analysis, we'll first need to


preprocess the data, as t-SNE works best with normalized features. Once preprocessed, we'll

26
apply t-SNE to reduce the dimensionality of the dataset to two dimensions, allowing us to
visualize the data in a scatter plot. Let's proceed with these steps:

​ Preprocessing: We'll handle any missing values and standardize or normalize the features
as necessary.
​ t-SNE Analysis: Applying t-SNE to reduce the dimensionality of the dataset to two
dimensions.
​ Visualization and Analysis: Creating a scatter plot of the two-dimensional t-SNE
embeddings and analyzing the results for any emerging patterns or clusters.

1. t-SNE Mapping: t-SNE is a technique for dimensionality reduction that is particularly well
suited for the visualization of high-dimensional datasets. The axes, labeled “t-SNE
Dimension 1” and “t-SNE Dimension 2”, represent the two principal components from the
t-SNE reduction. Each point in this plot represents a data point in the high-dimensional
dataset.
2. Clusters: The clusters of blue dots scattered across the plot represent the t-SNE mapped
data points. Each cluster of points might represent a grouping or classification within the
data. The proximity of points can be interpreted as a measure of similarity.

27
3. Interpretation: The exact interpretation of this plot depends on the nature of the original
data. However, in general, points that are close together in this plot are similar in some
sense in the original high-dimensional dataset, and points that are far apart are dissimilar.

By following these steps, we'll conduct a comprehensive t-SNE analysis of the


e5-Run2-June22-subset-100-cols.csv dataset, uncovering important aspects and structures
within the data for deeper exploration and understanding.

7. Major Learnings
Major Learnings:

​ Exploratory Data Analysis (EDA):


● Understanding the overall nature and quality of data through visualization and
summary statistics.
● Identifying patterns, trends, and anomalies in the dataset.
​ Data Preprocessing:
● Handling missing values, outliers, and incorrect data entries using appropriate
techniques such as imputation, removal, or correction.
● Standardizing or normalizing data to ensure consistency and comparability
between variables.
​ Correlation Analysis:
● Assessing relationships between variables using correlation coefficients and
visualizations like heatmaps.
● Understanding the strength and direction of correlations to inform further analysis.
​ Multicollinearity Detection:
● Identifying multicollinearity among variables using techniques like variance
inflation factor (VIF) analysis.
● Managing multicollinearity to avoid redundancy and improve model performance.
​ Dimensionality Reduction:
● Implementing techniques like Principal Component Analysis (PCA) to reduce the
dimensionality of data while preserving important information.
● Interpreting PCA results to understand the variability explained by principal
components.
​ Decision Making and Justification:
● Making informed decisions about data preprocessing, feature selection, and
analysis based on domain knowledge and data insights.

28
● Justifying decisions using statistical measures, visualization, and logical
reasoning.
​ Elbow Method:
● Utilizing the elbow method to determine the optimal number of clusters or principal
components in unsupervised learning tasks.
● Understanding the trade-off between explained variance and complexity to select
the appropriate number of components.
​ Visualization Techniques:
● Leveraging various visualization techniques such as scatter plots, histograms, box
plots, and heatmaps to explore and communicate data insights effectively.
● Choosing the most suitable visualization method based on the data characteristics
and analysis objectives.
​ Statistical Measures:
● Calculating descriptive statistics such as mean, median, standard deviation, and
quartiles to summarize data distributions.
● Using statistical tests and metrics to evaluate the effectiveness of data
preprocessing techniques and analysis results.
​ Domain Knowledge Integration:
● Integrating domain knowledge and context into data analysis and decision-making
processes to ensure the relevance and applicability of findings.
● Collaborating with domain experts to interpret results and derive actionable
insights from data analysis.

29

You might also like