EDA and Cleaning

Uploaded by

Sasankh Reddy N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views24 pages

EDA and Cleaning

Uploaded by

Sasankh Reddy N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Exploratory Data Analysis

1
Learning Objectives
● Understand what Exploratory Data Analysis (EDA) is
● Identify the goals and purpose of EDA
● Explore a Data Quality Report and why it is useful
● Learn key parts of EDA including checking for missing and duplicate values,
creating and leveraging data visualizations, identifying outliers, and
interpreting correlation

2
EDA and Data Cleaning
Business
Deployment
Understanding

● The vast majority of your

work will be cleaning and
exploring data
● Data cleaning and Evaluation Data
Understanding
exploration go hand in
Exploratory Data
hand Analysis
● This is where you will
spend 80% of your time as
Data
a data scientist Modeling
Preparation

Data Cleaning

3
Best Practices
● Goals of Exploratory Data Analysis
○ Understand your data and variables
○ Analyze relationships between
variables
● Purpose of Exploratory Data
Analysis
○ Get maximum insights
○ Uncover the underlying structure
○ Identify important features
○ Detect any issues or missing values

4
First Things First: Check for duplicate values!
● Important for accurate results and interpretation
○ Pandas: dataFrame.duplicated ()
○ R
● duplicated(): for identifying duplicated elements
● unique(): for extracting unique elements
● distinct() [dplyr package] to remove duplicate rows in a data frame
○ SPSS:
● https://www.ibm.com/support/pages/how-identify-duplicate-cases-ibm-spss-statistics
● Options for wizard and syntax
● Check for duplicates across all columns and subsets
● Make sure you have a unique identifier!!!! See pandas.unique()

5
Data Quality Report
● Tabular reports describing each feature in the data: python example

6
Data Quality Report
● Tabular reports describing each feature in the data

7
Checking for missing
values
● Critical for accurate understanding
and modeling
● Types of missing values
○ Missing Completely at Random
(MCAR)
○ Missing at Random (MAR)
○ Missing Not At Random (MNAR)
● Pandas isnull() function

8
There are certain steps you should ALWAYS DO
● Check the distribution of your DV
● Check the distribution of ALL IVs you will be testing
● When you look at distributions you must analyze with context – this means
you need to review data dictionary documentation
○ How was the data collected?
○ Does the documentation match the distributions you see?
○ Is data labeled and do the labels make sense?
○ Are there outliers or weird values????
● Look at the distribution of the DV for EACH IV – this could mean a bar chart,
scatter plot, box and whisker, line graph, etc.

9
Data Visualization
● Crucial for understanding patterns
and trends
○ Univariate – 1 variable
○ Bivariate – 2 variables
○ Multivariate – 3+ variables
● Never skip straight to multivariate
analysis! Really look at your
distribution and what it tells you
about your variables

10
Histograms
● Commonly used in data science
● Show the distribution of features
● Used in univariate and bivariate
analysis

11
Detecting outliers
● Outliers
○ Data points that differ significantly from others
○ Can negatively affect analysis
○ Result in lower accuracy in ML training
○ Caused by
■ Measurement or sampling errors
■ Human errors
■ Natural deviations
○ You need to know if data points are outliers
(valid) or errors (invalid)
● Identify outliers with data visualization
○ Box Plots and Scatterplots

12
Correlation
● Measures strength and direction of
relationship between variables
● Ranges from -1 to +1
● Positive correlation means that as one
variables increases (or decreases), the
other also increases (or decreases)
● Negative correlation means that as
one variable increases, the other
decreases
● Larger correlation values means
stronger relationships
● Correlation of 0 means no relationship

13
Questions you should be able to answer right away
● What is your dependent variable and how is it measured?
● What are your key independent variables? How are they measured?
● What does the distribution of your DV and IVs look like?
● Have you run any visuals that would indicate a possible relationship
between any of your IVs and the DV?

14
Group EDA Assignment
● This assignment will be submitted as a group
● Complete exploratory data analysis on a data set for your project
● You must include code and the output of your code in one document that is
in blackboard readable format (word doc, pdf, etc.)
● Each person in your group must create at least one visual and put their
name as a comment in the applicable code section
○ Only group members that actual code a visual with their name in the comment
will receive credit

15
Learning Objectives
● Understand what data cleaning is and its relationship with Exploratory Data
Analysis (EDA)
● Learn how to clean duplicate values, erroneous values, and missing values
● Explore options for cleaning outliers

16
Data Cleaning
● What is Data Cleaning?
○ Process of fixing/removing incorrect,
duplicate, or incomplete data
● Why is Data Cleaning important?
○ Without it, outcomes are unreliable
● How is Data Cleaning accomplished?
○ No single solution fits all data
○ Varies from dataset to dataset

17
CLEANING DUPLICATE VALUES
● How to clean duplicate values?

18
CLEANING DUPLICATE VALUES
● How to clean duplicate values?
○ Remove them! 99% of the time this is what you want to do
○ Understand why this occurred if possible

19
Data quality reports give insights on erroneous values
● What stands out?

20
Cleaning erroneous values
● Remove rows with
incorrect values
● Replace incorrect values
○ Replace with specific
value
○ Replace with
previous/next value
○ Replace with a
calculated value
● Try not to exceed 5% of
cases

21
Cleaning missing values (null)
● Removing missing values
○ Dropna()
● Drop rows with missing values
● Drop rows with a certain amount missing
● Replacing missing values
○ Fillna()
● Fill missing values with specific value (average)
● Fill missing values with previous value (586.0)
● Fill missing values with next value (691.0)
○ Interpolate()
● Fill missing values with a calculated value (median)

22
Fixing Outliers
● Difference between outliers
and incorrect values
○ Outliers are valid
● Methods of treating outliers
○ Remove outliers
○ Impute values
○ Use thresholds
○ Normalize the data
○ Use models less affected by
outliers
○ Analyze outliers separately

23
Fixing Outliers – Do’s and Don’ts
● Consider grouping – if you have outliers for people age 70
○ Outliers are valid
● Methods of treating outliers
○ Remove outliers
○ Impute values
○ Use thresholds
○ Normalize the data
○ Use models less affected by outliers
○ Analyze outliers separately

Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
EDA
100% (1)
EDA
9 pages
Assessment and Identification of Needs PDF
100% (1)
Assessment and Identification of Needs PDF
637 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
DM Merged
No ratings yet
DM Merged
169 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
DS203 2024 09 06 Data Problems 1
No ratings yet
DS203 2024 09 06 Data Problems 1
25 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Key Word Transformation 1 (Solved)
50% (2)
Key Word Transformation 1 (Solved)
11 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Lec 7
No ratings yet
Lec 7
45 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Action Research Proposal Sample
100% (3)
Action Research Proposal Sample
16 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
EDA - Task
No ratings yet
EDA - Task
20 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
DABD (KMBNIT01) Model Paper With Solution
No ratings yet
DABD (KMBNIT01) Model Paper With Solution
19 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Checklist For Review of Schematic Floor Plan
100% (1)
Checklist For Review of Schematic Floor Plan
2 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
EDA New
No ratings yet
EDA New
15 pages
Document
No ratings yet
Document
29 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Lesson Plan Grade 10
No ratings yet
Lesson Plan Grade 10
15 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Unit 4
No ratings yet
Unit 4
33 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Dev Core
No ratings yet
Dev Core
7 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Oral Communication in Context: Luzviminda P. Laureniana
100% (1)
Oral Communication in Context: Luzviminda P. Laureniana
10 pages
Lesson 5 Exploratory Data Analysis
No ratings yet
Lesson 5 Exploratory Data Analysis
10 pages
Consumer Behavior Introduction
No ratings yet
Consumer Behavior Introduction
121 pages
Fuzzy Set
No ratings yet
Fuzzy Set
8 pages
Agumentik Task 18 (Blog Writing On Various Topics) by Deep Vyas
No ratings yet
Agumentik Task 18 (Blog Writing On Various Topics) by Deep Vyas
11 pages
Teaching New Head Way Plus English Course
No ratings yet
Teaching New Head Way Plus English Course
39 pages
ROI Basics
No ratings yet
ROI Basics
221 pages
ANFIS Wind Speed Estimator-Based Output Feedback Near-Optimal MPPT Control For PMSG Wind Turbine
No ratings yet
ANFIS Wind Speed Estimator-Based Output Feedback Near-Optimal MPPT Control For PMSG Wind Turbine
11 pages
Turtle - Turtle Graphics - Python 3.12.3 Documentation
No ratings yet
Turtle - Turtle Graphics - Python 3.12.3 Documentation
34 pages
Fractions Multiplying Pictures
0% (1)
Fractions Multiplying Pictures
2 pages
PHD Thesis Eth
100% (2)
PHD Thesis Eth
8 pages
Edfo 418 Comparative Education
No ratings yet
Edfo 418 Comparative Education
4 pages
Syllabus Breakdown (Sociology)
No ratings yet
Syllabus Breakdown (Sociology)
4 pages
Valid Valuable
No ratings yet
Valid Valuable
80 pages
Deep Learning EECS 6327
No ratings yet
Deep Learning EECS 6327
43 pages
Algorithmic Injustice. A Relational Ethics Approach
No ratings yet
Algorithmic Injustice. A Relational Ethics Approach
9 pages
Critical Analysis of Judy Blumes Are You There God Its Me Margaret
No ratings yet
Critical Analysis of Judy Blumes Are You There God Its Me Margaret
11 pages
Sl. Tnea Name of The College District No. of No. of Percentage No. Code Students Students of Appeared Passed Pass
No ratings yet
Sl. Tnea Name of The College District No. of No. of Percentage No. Code Students Students of Appeared Passed Pass
1 page
Lesson Plan
No ratings yet
Lesson Plan
3 pages
My Idol
No ratings yet
My Idol
3 pages
Chapter 11 - Shara
No ratings yet
Chapter 11 - Shara
34 pages
Course Outline (EEE315Lab)
No ratings yet
Course Outline (EEE315Lab)
4 pages
Data Visualization
No ratings yet
Data Visualization
3 pages
KAIZEN
No ratings yet
KAIZEN
3 pages
Script For Teacher's Day Celebration 2023
No ratings yet
Script For Teacher's Day Celebration 2023
2 pages
Grade 8 PA3 Syllabus
No ratings yet
Grade 8 PA3 Syllabus
2 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet

EDA and Cleaning

Uploaded by

EDA and Cleaning

Uploaded by

Exploratory Data Analysis

● The vast majority of your

You might also like