0% found this document useful (0 votes)
5 views

Exploratory Data

Uploaded by

mariagemariagee5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Exploratory Data

Uploaded by

mariagemariagee5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Exploratory Data

Analysis I
Presented by: Pr. Asmae BENTALEB

Année Universitaire 2024-2025


Plan
 The importance of EDA
 Overview of EDA

 Steps of EDA
 Data collection

 Data preprocessing:
1 - Data understanding and quality assessement

2 - Detection and handling of duplicates

3 - Detection and handling of missing values

4 - Detection and handling of outliers


 Features Engineering
The importance of EDA (1)
In fact, real world data (collected by companies and organizations) is messy; it is filled with inaccuracies,

inconsistencies, and missing entries.

 If the predictive models are fed with inaccurate data, the performance and accuracy of the models will be

impacted negatively.
The importance of EDA (2)
Data cleaning and preprocessing are necessary to ensure data integrity.

Data integrity refers to the process of ensuring the accuracy, completeness, consistency, and validity of an
organization's data.

The phase of data preprocessing allows the refining and organization of raw data into a more suitable format that
can be analyzed effectively and reliably.

 The integrity of the data used in daily analysis directly affects the validity of our conclusions. Therefore,
spending time on this stage of the pipeline can save us from drawing incorrect conclusions, making poor
decisions, or developing ineffective models.
Exploratory Data Analysis overview
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects.

 It involves analyzing and visualizing data to understand its key characteristics, uncover patterns, and identify

relationships between variables. It is also about detecting outliers and missing values along with solutions to

handle them.

EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or

modeling..
Exploratory Data Analysis overview
This varies based on the project and the nature of the data. However, a typical workflow may involve the
following steps:

Fig 2: Exploratory Data Analysis


I- Data Collection
Data collection Overview

 Data scientists will obtain the data for their business problem from databases where their companies store their
data. For unstructured datasets (e.g. logs, raw texts, images, videos, etc.), they are collected via ETL pipelines
prepared by data .

 These datasets either reside in a data lake or in a database.

 When data scientists do not have the data needed to solve their problems, they can get the data by scraping it
from websites, purchasing data from data providers or collecting the data from surveys, clickstream data, sensors,
cameras.
II- Data Preprocessing
1 - Data understanding and quality assessment

 “Descriptive statistics” is used in the assessment of data quality. These are measures that provide a summary of
the data's central tendency, dispersion, and distribution.

 In Python, the pandas library offers a handy method called .describe(), which computes several descriptive
statistics for each column in the DataFrame.

The .describe() method provides count, mean, standard deviation, minimum. This output can provide vital clues
about potential data quality issues.
1 - Data understanding and quality assessment
The .describe() method provides count, mean,
standard deviation, minimum, 25th percentile,
median, 75th percentile, and maximum of the
columns. This output can provide vital clues about
potential data quality issues.

For example, a maximum value that's dramatically


larger than the 75th percentile might indicate the
presence of outliers.
1 - Data understanding and quality assessment
1 - Data understanding and quality assessment:
the Mean
Mean: is a measure of central tendency in statistics. It is calculated by adding together all the values in a dataset
and then dividing by the number of values => the sum of the values/the number of features.
1 - Data understanding and quality assessment:
the Percentile
A percentile is a statistical measure that indicates the value below which a given percentage of observations in a
dataset falls. For example, the 25th percentile (often called the first quartile) is the value below which 25% of the
data points lie. Similarly, the 50th percentile is the median, meaning that half of the data points fall below this
value.

The .describe() method can take other parameters.


1 - Data understanding and quality assessment:
the Percentile
 Percentiles divide the data into equal
portions, providing information on how values
are distributed across the dataset.

 The median, for example, represents the 50th


percentile, dividing the data into two equal
halves. The quartiles divide the data into four
equal parts: the first quartile (25th
percentile), median (50th percentile), and
third quartile (75th percentile)
1 - Data understanding and quality assessment:
Standard deviation
Standard deviation: is the average amount of variability in the dataset. It tells, on average, how far each value lies
from the mean.
A high standard deviation means that values are generally far from the mean, while a low standard deviation
indicates that values are clustered close to the mean.
1 - Data understanding and quality assessment:
Standard deviation
◦ A lower standard deviation indicates that data points are close to the mean, which can suggest consistency or
homogeneity in the data. This might be desirable in situations where uniformity is important, such as in
quality control.

◦ Distribution Shape: In EDA, standard deviation is used to understand the shape of the distribution.

◦ If the standard deviation is large relative to the mean, it might suggest that the data is skewed or has outliers.
1 - Data understanding and quality assessment:
Visualisation
One of the significant aspects of EDA is visual exploration. Visualizing the data can provide insights that might not
be evident from just looking at the data in the form of tables. For instance, histograms can provide a snapshot of
the distribution of the data.

The histogram's shape can provide significant insights into the nature of the data. A roughly symmetrical
histogram might indicate normally distributed data, whereas a skewed histogram could suggest the presence of
outliers.
Example of Visual exploration of the data
2 - Detection and handling of duplicates

Duplicates are repeated records in the dataset. They can bias the analysis and lead to incorrect conclusions.

Redundant data are data that do not add any new information. They can slow down the computations and take up

unnecessary storage space.  they should be deleted.


3 - Detection and handling of missing values:
Overview
 Missing Values are due to a variety of reasons such as human error during data entry, issues with data collection

processes. They are perhaps the most ubiquitous data quality issue.

 Depending on the reason for their existence, missing values can lead to skewed analyses or introduce bias in the

ML models. As such, appropriate handling of missing values is a crucial step in maintaining the integrity of the

data analysis. But before handling them, we need to identify them.


3 - Detection and handling of missing values

 Missing Values: using the .info() method.


3 - Detection and handling of missing values
 This script prints the count of missing values in each column, offering a first-pass insight into the degree and
distribution of missingness in the dataset.
3 - Detection and handling of missing values:
Dropping columns with NULL values

If a certain column has many missing values, if majority of the data points has NULL value for a particular column
then it can be simply dropped.
3 - Detection and handling of missing values:
Deletion of rows with missing values

Deletion: This is the simplest method, which involves deleting the records with missing values. This results in a loss
of information. Here's how to do it with pandas:
3 - Detection and handling of missing values:
replacing the NULL values with mean/median
• Numeric continuous columns in a dataset can be filled with the mean (for continuous data), median (for ordinal
data helping to prevent data loss, especially for small amounts of missing data.

• Mean imputation works well for normally distributed data, while median is preferable for skewed data.

• This method cannot be applied to categorical columns.


3 - Detection and handling of missing values:
replacing the NULL values with constant
• Constant Value Imputation: This involves replacing missing values with a constant. This method is useful when
one can make an educated guess about the missing values.
3 - Detection and handling of missing values:
handling NULL values for categorical variables
•For missing values in categorical columns, there are two known options:
• They can be replaced with the most frequent category (mode).
• If there are many missing values, it's better to create a new category labeled "Unknown" or "Missing" to retain
information about the missingness and avoid bias..
3 - Detection and handling of missing values:
Forward Fill and Backward Fill
•Forward fill (ffill) and backward fill (bfill) are methods used to fill missing values by carrying forward the last
observed non-missing value (for ffill) or by carrying backward the next observed non-missing value (for bfill).
These methods are particularly useful for time-series data.
3 - Detection and handling of missing values:
Forward Fill and Backward Fill

3 - Detection and handling of missing values:
Multiple Imputation
•Multiple Imputation: The Iterative Imputer, part of the scikit-learn library, uses the Multiple Imputation by
Chained Equations (MICE) algorithm to fill in missing values. It imputes one variable at a time, taking into account
the other variables in the dataset, allowing for more accurate imputation.
3 - Detection and handling of missing values:
Other advanced methods(Predictive Imputation)

• Predictive imputation involves using machine learning models to predict missing values. While a simple linear

regression might be sufficient in some cases, more sophisticated methods like decision trees, random forests, or

even neural networks might yield better results, depending on the complexity of the data.
3 - Detection and handling of missing values:
Interpolation
• Interpolation is a technique used to fill missing values based on the values of adjacent datapoints. This technique
is mainly used in the case of time series data or in situations where the missing data points are expected to vary
smoothly or follow a certain trend.
4 - Detection and handling of outliers:
What is an outlier?
 Outliers are data points that differ significantly from other observations in the dataset.

 They can occur due to reasons like measurement errors, data entry errors, or they could be valid but extreme

observations.

 Regardless of the source, outliers can greatly impact the results of the conducted data analysis and predictive

modeling. It is therefore critical to identify and appropriately handle them.


4 - Detection and handling of outliers:
Impact of having outliers
Affect Mean and Standard Deviation: Outliers can significantly skew the mean and inflate the standard

deviation, distorting the overall data distribution.

Impact Model Accuracy: Many machine learning algorithms are sensitive to the range and distribution of

attribute values. Outliers can mislead the training process, resulting in longer training times and less accurate

models.
4 - Detection and handling of outliers:
Example of outlier
4 - Detection and handling of outliers:
detection techniques
 Outliers can be detected using visualization, implementing mathematical formulas on the dataset, or using the

statistical approach.

 For example, boxplot, summarizes sample data using 25th, 50th, and 75th percentiles. One can just get insights

(quartiles, median, and outliers) into the dataset by just looking at its boxplot.

Scatter plot can be used for outliers’ detection also.


4-Detection and handling of outliers:
Visualizing and Removing Outliers Using Box Plot
Boxplot summarizes sample data using 25th, 50th, and
75th percentiles. One can just get insights(quartiles,
median, and outliers) into the dataset by just looking at
its boxplot:

Note: we can see that values above 10 are acting as


outliers.
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Box Plot
 This is done through defining a threshold and
removing outliers based on that condition. Attached is
the boxplot after removing outliers.
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Scatter Plot

 To plot the scatter plot one requires two

variables that are somehow related to

each other, ex( blood pressure and body

mass index).
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Scatter Plot
To remove the outliers, we detect the interval containing them from the scatter plot, in this example, the

outliers are located in the interval where ‘bmi’ is greater than 0.12 and ‘bp’ is less than 0.8. The output

provides the row and column indices of the outlier positions in the DataFrame.
4 - Detection and handling of outliers:
Visualizing and Removing Outliers Using Scatter Plot

 The function np.where() from the NumPy’s


was used to identify idencies of the specified
interval.

Here is the scatter plot after removing the


outliers:
4 - Detection and handling of outliers:
Handling outliers using z score
 Z-Score is also called a standard score. This value/score helps to understand how far is the data point from the
mean. And after setting up a threshold value, one can utilize z score values of data points to define the outliers.

Z-Score = (data_point - mean) /std.deviation


4 - Detection and handling of outliers:
Handling outliers with z score example
 Example of calculating
the z-score for the
column “age”
4 - Detection and handling of outliers:
Outliers removal using z-score
 To define outliers, a threshold value is
chosen which is generally 2 or 3.
4 - Detection and handling of outliers:
Handling outliers using IQR
 IQR (Inter Quartile Range): Inter Quartile Range approach to finding the outliers is the most commonly used
and most trusted approach used in the research field:

IQR = Quartile3 – Quartile1


4 - Detection and handling of outliers:
Handling outliers using IQR
 Let’s take the example of calculating the interquartile range (IQR) for the ‘bmi’ column in the DataFrame

df_diabetics .

 It will first compute the first quartile (Q1) and third quartile (Q3) using the midpoint method, then calculates

the IQR as the difference between Q3 and Q1, providing a measure of the spread of the middle 50% of the

data in the ‘bmi’ column.

You might also like