0% found this document useful (0 votes)

5 views

Exploratory Data

Uploaded by

mariagemariagee5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Exploratory Data

Uploaded by

mariagemariagee5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Exploratory Data

Analysis I
Presented by: Pr. Asmae BENTALEB

Année Universitaire 2024-2025

Plan
 The importance of EDA
 Overview of EDA

 Steps of EDA
 Data collection

 Data preprocessing:
1 - Data understanding and quality assessement

2 - Detection and handling of duplicates

3 - Detection and handling of missing values

4 - Detection and handling of outliers

 Features Engineering
The importance of EDA (1)
In fact, real world data (collected by companies and organizations) is messy; it is filled with inaccuracies,

inconsistencies, and missing entries.

 If the predictive models are fed with inaccurate data, the performance and accuracy of the models will be

impacted negatively.
The importance of EDA (2)
Data cleaning and preprocessing are necessary to ensure data integrity.

Data integrity refers to the process of ensuring the accuracy, completeness, consistency, and validity of an
organization's data.

The phase of data preprocessing allows the refining and organization of raw data into a more suitable format that
can be analyzed effectively and reliably.

 The integrity of the data used in daily analysis directly affects the validity of our conclusions. Therefore,
spending time on this stage of the pipeline can save us from drawing incorrect conclusions, making poor
decisions, or developing ineffective models.
Exploratory Data Analysis overview
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects.

 It involves analyzing and visualizing data to understand its key characteristics, uncover patterns, and identify

relationships between variables. It is also about detecting outliers and missing values along with solutions to

handle them.

EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or

modeling..
Exploratory Data Analysis overview
This varies based on the project and the nature of the data. However, a typical workflow may involve the
following steps:

Fig 2: Exploratory Data Analysis

I- Data Collection
Data collection Overview

 Data scientists will obtain the data for their business problem from databases where their companies store their
data. For unstructured datasets (e.g. logs, raw texts, images, videos, etc.), they are collected via ETL pipelines
prepared by data .

 These datasets either reside in a data lake or in a database.

 When data scientists do not have the data needed to solve their problems, they can get the data by scraping it
from websites, purchasing data from data providers or collecting the data from surveys, clickstream data, sensors,
cameras.
II- Data Preprocessing
1 - Data understanding and quality assessment

 “Descriptive statistics” is used in the assessment of data quality. These are measures that provide a summary of
the data's central tendency, dispersion, and distribution.

 In Python, the pandas library offers a handy method called .describe(), which computes several descriptive
statistics for each column in the DataFrame.

The .describe() method provides count, mean, standard deviation, minimum. This output can provide vital clues
about potential data quality issues.
1 - Data understanding and quality assessment
The .describe() method provides count, mean,
standard deviation, minimum, 25th percentile,
median, 75th percentile, and maximum of the
columns. This output can provide vital clues about
potential data quality issues.

For example, a maximum value that's dramatically

larger than the 75th percentile might indicate the
presence of outliers.
1 - Data understanding and quality assessment
1 - Data understanding and quality assessment:
the Mean
Mean: is a measure of central tendency in statistics. It is calculated by adding together all the values in a dataset
and then dividing by the number of values => the sum of the values/the number of features.
1 - Data understanding and quality assessment:
the Percentile
A percentile is a statistical measure that indicates the value below which a given percentage of observations in a
dataset falls. For example, the 25th percentile (often called the first quartile) is the value below which 25% of the
data points lie. Similarly, the 50th percentile is the median, meaning that half of the data points fall below this
value.

The .describe() method can take other parameters.

1 - Data understanding and quality assessment:
the Percentile
 Percentiles divide the data into equal
portions, providing information on how values
are distributed across the dataset.

 The median, for example, represents the 50th

percentile, dividing the data into two equal
halves. The quartiles divide the data into four
equal parts: the first quartile (25th
percentile), median (50th percentile), and
third quartile (75th percentile)
1 - Data understanding and quality assessment:
Standard deviation
Standard deviation: is the average amount of variability in the dataset. It tells, on average, how far each value lies
from the mean.
A high standard deviation means that values are generally far from the mean, while a low standard deviation
indicates that values are clustered close to the mean.
1 - Data understanding and quality assessment:
Standard deviation
◦ A lower standard deviation indicates that data points are close to the mean, which can suggest consistency or
homogeneity in the data. This might be desirable in situations where uniformity is important, such as in
quality control.

◦ Distribution Shape: In EDA, standard deviation is used to understand the shape of the distribution.

◦ If the standard deviation is large relative to the mean, it might suggest that the data is skewed or has outliers.
1 - Data understanding and quality assessment:
Visualisation
One of the significant aspects of EDA is visual exploration. Visualizing the data can provide insights that might not
be evident from just looking at the data in the form of tables. For instance, histograms can provide a snapshot of
the distribution of the data.

The histogram's shape can provide significant insights into the nature of the data. A roughly symmetrical
histogram might indicate normally distributed data, whereas a skewed histogram could suggest the presence of
outliers.
Example of Visual exploration of the data
2 - Detection and handling of duplicates

Duplicates are repeated records in the dataset. They can bias the analysis and lead to incorrect conclusions.

Redundant data are data that do not add any new information. They can slow down the computations and take up

unnecessary storage space.  they should be deleted.

3 - Detection and handling of missing values:
Overview
 Missing Values are due to a variety of reasons such as human error during data entry, issues with data collection

processes. They are perhaps the most ubiquitous data quality issue.

 Depending on the reason for their existence, missing values can lead to skewed analyses or introduce bias in the

ML models. As such, appropriate handling of missing values is a crucial step in maintaining the integrity of the

data analysis. But before handling them, we need to identify them.

3 - Detection and handling of missing values

 Missing Values: using the .info() method.

3 - Detection and handling of missing values
 This script prints the count of missing values in each column, offering a first-pass insight into the degree and
distribution of missingness in the dataset.
3 - Detection and handling of missing values:
Dropping columns with NULL values

If a certain column has many missing values, if majority of the data points has NULL value for a particular column
then it can be simply dropped.
3 - Detection and handling of missing values:
Deletion of rows with missing values

Deletion: This is the simplest method, which involves deleting the records with missing values. This results in a loss
of information. Here's how to do it with pandas:
3 - Detection and handling of missing values:
replacing the NULL values with mean/median
• Numeric continuous columns in a dataset can be filled with the mean (for continuous data), median (for ordinal
data helping to prevent data loss, especially for small amounts of missing data.

• Mean imputation works well for normally distributed data, while median is preferable for skewed data.

• This method cannot be applied to categorical columns.

3 - Detection and handling of missing values:
replacing the NULL values with constant
• Constant Value Imputation: This involves replacing missing values with a constant. This method is useful when
one can make an educated guess about the missing values.
3 - Detection and handling of missing values:
handling NULL values for categorical variables
•For missing values in categorical columns, there are two known options:
• They can be replaced with the most frequent category (mode).
• If there are many missing values, it's better to create a new category labeled "Unknown" or "Missing" to retain
information about the missingness and avoid bias..
3 - Detection and handling of missing values:
Forward Fill and Backward Fill
•Forward fill (ffill) and backward fill (bfill) are methods used to fill missing values by carrying forward the last
observed non-missing value (for ffill) or by carrying backward the next observed non-missing value (for bfill).
These methods are particularly useful for time-series data.
3 - Detection and handling of missing values:
Forward Fill and Backward Fill
•
3 - Detection and handling of missing values:
Multiple Imputation
•Multiple Imputation: The Iterative Imputer, part of the scikit-learn library, uses the Multiple Imputation by
Chained Equations (MICE) algorithm to fill in missing values. It imputes one variable at a time, taking into account
the other variables in the dataset, allowing for more accurate imputation.
3 - Detection and handling of missing values:
Other advanced methods(Predictive Imputation)

• Predictive imputation involves using machine learning models to predict missing values. While a simple linear

regression might be sufficient in some cases, more sophisticated methods like decision trees, random forests, or

even neural networks might yield better results, depending on the complexity of the data.
3 - Detection and handling of missing values:
Interpolation
• Interpolation is a technique used to fill missing values based on the values of adjacent datapoints. This technique
is mainly used in the case of time series data or in situations where the missing data points are expected to vary
smoothly or follow a certain trend.
4 - Detection and handling of outliers:
What is an outlier?
 Outliers are data points that differ significantly from other observations in the dataset.

 They can occur due to reasons like measurement errors, data entry errors, or they could be valid but extreme

observations.

 Regardless of the source, outliers can greatly impact the results of the conducted data analysis and predictive

modeling. It is therefore critical to identify and appropriately handle them.

4 - Detection and handling of outliers:
Impact of having outliers
Affect Mean and Standard Deviation: Outliers can significantly skew the mean and inflate the standard

deviation, distorting the overall data distribution.

Impact Model Accuracy: Many machine learning algorithms are sensitive to the range and distribution of

attribute values. Outliers can mislead the training process, resulting in longer training times and less accurate

models.
4 - Detection and handling of outliers:
Example of outlier
4 - Detection and handling of outliers:
detection techniques
 Outliers can be detected using visualization, implementing mathematical formulas on the dataset, or using the

statistical approach.

 For example, boxplot, summarizes sample data using 25th, 50th, and 75th percentiles. One can just get insights

(quartiles, median, and outliers) into the dataset by just looking at its boxplot.

Scatter plot can be used for outliers’ detection also.

4-Detection and handling of outliers:
Visualizing and Removing Outliers Using Box Plot
Boxplot summarizes sample data using 25th, 50th, and
75th percentiles. One can just get insights(quartiles,
median, and outliers) into the dataset by just looking at
its boxplot:

Note: we can see that values above 10 are acting as

outliers.
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Box Plot
 This is done through defining a threshold and
removing outliers based on that condition. Attached is
the boxplot after removing outliers.
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Scatter Plot

 To plot the scatter plot one requires two

variables that are somehow related to

each other, ex( blood pressure and body

mass index).
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Scatter Plot
To remove the outliers, we detect the interval containing them from the scatter plot, in this example, the

outliers are located in the interval where ‘bmi’ is greater than 0.12 and ‘bp’ is less than 0.8. The output

provides the row and column indices of the outlier positions in the DataFrame.
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Scatter Plot

 The function np.where() from the NumPy’s

was used to identify idencies of the specified
interval.

Here is the scatter plot after removing the

outliers:
4 - Detection and handling of outliers:
Handling outliers using z score
 Z-Score is also called a standard score. This value/score helps to understand how far is the data point from the
mean. And after setting up a threshold value, one can utilize z score values of data points to define the outliers.

Z-Score = (data_point - mean) /std.deviation

4 - Detection and handling of outliers:
Handling outliers with z score example
 Example of calculating
the z-score for the
column “age”
4 - Detection and handling of outliers:
Outliers removal using z-score
 To define outliers, a threshold value is
chosen which is generally 2 or 3.
4 - Detection and handling of outliers:
Handling outliers using IQR
 IQR (Inter Quartile Range): Inter Quartile Range approach to finding the outliers is the most commonly used
and most trusted approach used in the research field:

IQR = Quartile3 – Quartile1

4 - Detection and handling of outliers:
Handling outliers using IQR
 Let’s take the example of calculating the interquartile range (IQR) for the ‘bmi’ column in the DataFrame

df_diabetics .

 It will first compute the first quartile (Q1) and third quartile (Q3) using the midpoint method, then calculates

the IQR as the difference between Q3 and Q1, providing a measure of the spread of the middle 50% of the

data in the ‘bmi’ column.

Step-by-Step Exploratory Data Analysis (EDA) Using Python
100% (1)
Step-by-Step Exploratory Data Analysis (EDA) Using Python
20 pages
Regression Modeling Strategies
No ratings yet
Regression Modeling Strategies
506 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Quality
No ratings yet
Data Quality
14 pages
Unit 1
No ratings yet
Unit 1
21 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Exploratory Data Analysis EDA Part of Data PreProcessing
No ratings yet
Exploratory Data Analysis EDA Part of Data PreProcessing
11 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
ML_EXP_NO_1
No ratings yet
ML_EXP_NO_1
8 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Subtitle
No ratings yet
Subtitle
2 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Data Quality
100% (2)
Data Quality
16 pages
253777
No ratings yet
253777
66 pages
ppt2
No ratings yet
ppt2
57 pages
Research File 3
No ratings yet
Research File 3
10 pages
EDA
100% (1)
EDA
9 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
1st Part of Material
No ratings yet
1st Part of Material
15 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
EDA 2
No ratings yet
EDA 2
69 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Week3- Data Preprocessing, Extraction and Preparation
No ratings yet
Week3- Data Preprocessing, Extraction and Preparation
34 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Unit 2
No ratings yet
Unit 2
58 pages
data analysis
No ratings yet
data analysis
42 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
data science slides
No ratings yet
data science slides
57 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
The Development of Prosocial Behaviour in Early Childhood
No ratings yet
The Development of Prosocial Behaviour in Early Childhood
23 pages
Crop Yield Report BT-4-1
No ratings yet
Crop Yield Report BT-4-1
23 pages
Data Cleaning Ebook
No ratings yet
Data Cleaning Ebook
25 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
My451 PDF
No ratings yet
My451 PDF
264 pages
Reporting quantitative research in psychology how to meet APA style journal article reporting standards Second Edition Harris M. Cooper all chapter instant download
100% (2)
Reporting quantitative research in psychology how to meet APA style journal article reporting standards Second Edition Harris M. Cooper all chapter instant download
55 pages
Thesis, Predicting Teacher Absenteeism
100% (1)
Thesis, Predicting Teacher Absenteeism
107 pages
Applied Longitudinal Data Analysis for Medical Science: A Practical Guide 3rd Edition Twisk - Download the ebook in PDF with all chapters to read anytime
No ratings yet
Applied Longitudinal Data Analysis for Medical Science: A Practical Guide 3rd Edition Twisk - Download the ebook in PDF with all chapters to read anytime
80 pages
SPSS
No ratings yet
SPSS
92 pages
Application Of Structural Equation Modeling In Educational Research And Practice Timothy Teo instant download
No ratings yet
Application Of Structural Equation Modeling In Educational Research And Practice Timothy Teo instant download
79 pages
BayesiaLab User Guide
No ratings yet
BayesiaLab User Guide
380 pages
Spiritual and Religious Beliefs As Risk Factors For The Onset of Major Depression
No ratings yet
Spiritual and Religious Beliefs As Risk Factors For The Onset of Major Depression
13 pages
A Methodology For Conducting Retrospective Chart Review Research in Child and Adolescent Psychiatry
No ratings yet
A Methodology For Conducting Retrospective Chart Review Research in Child and Adolescent Psychiatry
6 pages
Journal Pone 0258219
No ratings yet
Journal Pone 0258219
14 pages
D3181 PDF
No ratings yet
D3181 PDF
6 pages
AI_Readiness_Self-Assessment_for_PMOs_v1.4
No ratings yet
AI_Readiness_Self-Assessment_for_PMOs_v1.4
33 pages
Wisconsin Card Sorting Test Embedded Validity Indicators Developed For Adults Can Be Extended To Children
No ratings yet
Wisconsin Card Sorting Test Embedded Validity Indicators Developed For Adults Can Be Extended To Children
16 pages
Predicting Reading Fluency Growth from Grade 2 to Age 23 with Parental and Child Factors
No ratings yet
Predicting Reading Fluency Growth from Grade 2 to Age 23 with Parental and Child Factors
26 pages
Barry (2016) - Business Intelligence With R
No ratings yet
Barry (2016) - Business Intelligence With R
301 pages
The Role of Family Functioning
No ratings yet
The Role of Family Functioning
84 pages
What Kind of Democracy Do We All Support
No ratings yet
What Kind of Democracy Do We All Support
31 pages
Time Series Data and their characteristics
No ratings yet
Time Series Data and their characteristics
14 pages
Data Science and Data Analytics: Part B
No ratings yet
Data Science and Data Analytics: Part B
42 pages
Predictive Modeling Projectt
No ratings yet
Predictive Modeling Projectt
109 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
NEJM - OrION-10 and ORION-11 - 2020-03-18 Supplementary Appendix
No ratings yet
NEJM - OrION-10 and ORION-11 - 2020-03-18 Supplementary Appendix
31 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Loan Default Prediction Using Machine Learning
No ratings yet
Loan Default Prediction Using Machine Learning
5 pages

Exploratory Data

Uploaded by

Exploratory Data

Uploaded by

Exploratory Data

Année Universitaire 2024-2025

2 - Detection and handling of duplicates

3 - Detection and handling of missing values

4 - Detection and handling of outliers

inconsistencies, and missing entries.

Fig 2: Exploratory Data Analysis

 These datasets either reside in a data lake or in a database.

For example, a maximum value that's dramatically

The .describe() method can take other parameters.

 The median, for example, represents the 50th

unnecessary storage space.  they should be deleted.

data analysis. But before handling them, we need to identify them.

 Missing Values: using the .info() method.

• This method cannot be applied to categorical columns.

modeling. It is therefore critical to identify and appropriately handle them.

deviation, distorting the overall data distribution.

Scatter plot can be used for outliers’ detection also.

Note: we can see that values above 10 are acting as

 To plot the scatter plot one requires two

variables that are somehow related to

each other, ex( blood pressure and body

 The function np.where() from the NumPy’s

Here is the scatter plot after removing the

Z-Score = (data_point - mean) /std.deviation

IQR = Quartile3 – Quartile1

data in the ‘bmi’ column.

You might also like