Exploratory Data
Exploratory Data
Analysis I
Presented by: Pr. Asmae BENTALEB
Steps of EDA
Data collection
Data preprocessing:
1 - Data understanding and quality assessement
If the predictive models are fed with inaccurate data, the performance and accuracy of the models will be
impacted negatively.
The importance of EDA (2)
Data cleaning and preprocessing are necessary to ensure data integrity.
Data integrity refers to the process of ensuring the accuracy, completeness, consistency, and validity of an
organization's data.
The phase of data preprocessing allows the refining and organization of raw data into a more suitable format that
can be analyzed effectively and reliably.
The integrity of the data used in daily analysis directly affects the validity of our conclusions. Therefore,
spending time on this stage of the pipeline can save us from drawing incorrect conclusions, making poor
decisions, or developing ineffective models.
Exploratory Data Analysis overview
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects.
It involves analyzing and visualizing data to understand its key characteristics, uncover patterns, and identify
relationships between variables. It is also about detecting outliers and missing values along with solutions to
handle them.
EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or
modeling..
Exploratory Data Analysis overview
This varies based on the project and the nature of the data. However, a typical workflow may involve the
following steps:
Data scientists will obtain the data for their business problem from databases where their companies store their
data. For unstructured datasets (e.g. logs, raw texts, images, videos, etc.), they are collected via ETL pipelines
prepared by data .
When data scientists do not have the data needed to solve their problems, they can get the data by scraping it
from websites, purchasing data from data providers or collecting the data from surveys, clickstream data, sensors,
cameras.
II- Data Preprocessing
1 - Data understanding and quality assessment
“Descriptive statistics” is used in the assessment of data quality. These are measures that provide a summary of
the data's central tendency, dispersion, and distribution.
In Python, the pandas library offers a handy method called .describe(), which computes several descriptive
statistics for each column in the DataFrame.
The .describe() method provides count, mean, standard deviation, minimum. This output can provide vital clues
about potential data quality issues.
1 - Data understanding and quality assessment
The .describe() method provides count, mean,
standard deviation, minimum, 25th percentile,
median, 75th percentile, and maximum of the
columns. This output can provide vital clues about
potential data quality issues.
◦ Distribution Shape: In EDA, standard deviation is used to understand the shape of the distribution.
◦ If the standard deviation is large relative to the mean, it might suggest that the data is skewed or has outliers.
1 - Data understanding and quality assessment:
Visualisation
One of the significant aspects of EDA is visual exploration. Visualizing the data can provide insights that might not
be evident from just looking at the data in the form of tables. For instance, histograms can provide a snapshot of
the distribution of the data.
The histogram's shape can provide significant insights into the nature of the data. A roughly symmetrical
histogram might indicate normally distributed data, whereas a skewed histogram could suggest the presence of
outliers.
Example of Visual exploration of the data
2 - Detection and handling of duplicates
Duplicates are repeated records in the dataset. They can bias the analysis and lead to incorrect conclusions.
Redundant data are data that do not add any new information. They can slow down the computations and take up
processes. They are perhaps the most ubiquitous data quality issue.
Depending on the reason for their existence, missing values can lead to skewed analyses or introduce bias in the
ML models. As such, appropriate handling of missing values is a crucial step in maintaining the integrity of the
If a certain column has many missing values, if majority of the data points has NULL value for a particular column
then it can be simply dropped.
3 - Detection and handling of missing values:
Deletion of rows with missing values
Deletion: This is the simplest method, which involves deleting the records with missing values. This results in a loss
of information. Here's how to do it with pandas:
3 - Detection and handling of missing values:
replacing the NULL values with mean/median
• Numeric continuous columns in a dataset can be filled with the mean (for continuous data), median (for ordinal
data helping to prevent data loss, especially for small amounts of missing data.
• Mean imputation works well for normally distributed data, while median is preferable for skewed data.
• Predictive imputation involves using machine learning models to predict missing values. While a simple linear
regression might be sufficient in some cases, more sophisticated methods like decision trees, random forests, or
even neural networks might yield better results, depending on the complexity of the data.
3 - Detection and handling of missing values:
Interpolation
• Interpolation is a technique used to fill missing values based on the values of adjacent datapoints. This technique
is mainly used in the case of time series data or in situations where the missing data points are expected to vary
smoothly or follow a certain trend.
4 - Detection and handling of outliers:
What is an outlier?
Outliers are data points that differ significantly from other observations in the dataset.
They can occur due to reasons like measurement errors, data entry errors, or they could be valid but extreme
observations.
Regardless of the source, outliers can greatly impact the results of the conducted data analysis and predictive
Impact Model Accuracy: Many machine learning algorithms are sensitive to the range and distribution of
attribute values. Outliers can mislead the training process, resulting in longer training times and less accurate
models.
4 - Detection and handling of outliers:
Example of outlier
4 - Detection and handling of outliers:
detection techniques
Outliers can be detected using visualization, implementing mathematical formulas on the dataset, or using the
statistical approach.
For example, boxplot, summarizes sample data using 25th, 50th, and 75th percentiles. One can just get insights
(quartiles, median, and outliers) into the dataset by just looking at its boxplot.
mass index).
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Scatter Plot
To remove the outliers, we detect the interval containing them from the scatter plot, in this example, the
outliers are located in the interval where ‘bmi’ is greater than 0.12 and ‘bp’ is less than 0.8. The output
provides the row and column indices of the outlier positions in the DataFrame.
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Scatter Plot
df_diabetics .
It will first compute the first quartile (Q1) and third quartile (Q3) using the midpoint method, then calculates
the IQR as the difference between Q3 and Q1, providing a measure of the spread of the middle 50% of the