0% found this document useful (0 votes)
49 views24 pages

EDA and Cleaning

Uploaded by

Sasankh Reddy N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views24 pages

EDA and Cleaning

Uploaded by

Sasankh Reddy N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Exploratory Data Analysis

1
Learning Objectives
● Understand what Exploratory Data Analysis (EDA) is
● Identify the goals and purpose of EDA
● Explore a Data Quality Report and why it is useful
● Learn key parts of EDA including checking for missing and duplicate values,
creating and leveraging data visualizations, identifying outliers, and
interpreting correlation

2
EDA and Data Cleaning
Business
Deployment
Understanding

● The vast majority of your


work will be cleaning and
exploring data
● Data cleaning and Evaluation Data
Understanding
exploration go hand in
Exploratory Data
hand Analysis
● This is where you will
spend 80% of your time as
Data
a data scientist Modeling
Preparation

Data Cleaning

3
Best Practices
● Goals of Exploratory Data Analysis
○ Understand your data and variables
○ Analyze relationships between
variables
● Purpose of Exploratory Data
Analysis
○ Get maximum insights
○ Uncover the underlying structure
○ Identify important features
○ Detect any issues or missing values

4
First Things First: Check for duplicate values!
● Important for accurate results and interpretation
○ Pandas: dataFrame.duplicated ()
○ R
● duplicated(): for identifying duplicated elements
● unique(): for extracting unique elements
● distinct() [dplyr package] to remove duplicate rows in a data frame
○ SPSS:
● https://www.ibm.com/support/pages/how-identify-duplicate-cases-ibm-spss-statistics
● Options for wizard and syntax
● Check for duplicates across all columns and subsets
● Make sure you have a unique identifier!!!! See pandas.unique()

5
Data Quality Report
● Tabular reports describing each feature in the data: python example

6
Data Quality Report
● Tabular reports describing each feature in the data

7
Checking for missing
values
● Critical for accurate understanding
and modeling
● Types of missing values
○ Missing Completely at Random
(MCAR)
○ Missing at Random (MAR)
○ Missing Not At Random (MNAR)
● Pandas isnull() function

8
There are certain steps you should ALWAYS DO
● Check the distribution of your DV
● Check the distribution of ALL IVs you will be testing
● When you look at distributions you must analyze with context – this means
you need to review data dictionary documentation
○ How was the data collected?
○ Does the documentation match the distributions you see?
○ Is data labeled and do the labels make sense?
○ Are there outliers or weird values????
● Look at the distribution of the DV for EACH IV – this could mean a bar chart,
scatter plot, box and whisker, line graph, etc.

9
Data Visualization
● Crucial for understanding patterns
and trends
○ Univariate – 1 variable
○ Bivariate – 2 variables
○ Multivariate – 3+ variables
● Never skip straight to multivariate
analysis! Really look at your
distribution and what it tells you
about your variables

10
Histograms
● Commonly used in data science
● Show the distribution of features
● Used in univariate and bivariate
analysis

11
Detecting outliers
● Outliers
○ Data points that differ significantly from others
○ Can negatively affect analysis
○ Result in lower accuracy in ML training
○ Caused by
■ Measurement or sampling errors
■ Human errors
■ Natural deviations
○ You need to know if data points are outliers
(valid) or errors (invalid)
● Identify outliers with data visualization
○ Box Plots and Scatterplots

12
Correlation
● Measures strength and direction of
relationship between variables
● Ranges from -1 to +1
● Positive correlation means that as one
variables increases (or decreases), the
other also increases (or decreases)
● Negative correlation means that as
one variable increases, the other
decreases
● Larger correlation values means
stronger relationships
● Correlation of 0 means no relationship

13
Questions you should be able to answer right away
● What is your dependent variable and how is it measured?
● What are your key independent variables? How are they measured?
● What does the distribution of your DV and IVs look like?
● Have you run any visuals that would indicate a possible relationship
between any of your IVs and the DV?

14
Group EDA Assignment
● This assignment will be submitted as a group
● Complete exploratory data analysis on a data set for your project
● You must include code and the output of your code in one document that is
in blackboard readable format (word doc, pdf, etc.)
● Each person in your group must create at least one visual and put their
name as a comment in the applicable code section
○ Only group members that actual code a visual with their name in the comment
will receive credit

15
Learning Objectives
● Understand what data cleaning is and its relationship with Exploratory Data
Analysis (EDA)
● Learn how to clean duplicate values, erroneous values, and missing values
● Explore options for cleaning outliers

16
Data Cleaning
● What is Data Cleaning?
○ Process of fixing/removing incorrect,
duplicate, or incomplete data
● Why is Data Cleaning important?
○ Without it, outcomes are unreliable
● How is Data Cleaning accomplished?
○ No single solution fits all data
○ Varies from dataset to dataset

17
CLEANING DUPLICATE VALUES
● How to clean duplicate values?

18
CLEANING DUPLICATE VALUES
● How to clean duplicate values?
○ Remove them! 99% of the time this is what you want to do
○ Understand why this occurred if possible

19
Data quality reports give insights on erroneous values
● What stands out?

20
Cleaning erroneous values
● Remove rows with
incorrect values
● Replace incorrect values
○ Replace with specific
value
○ Replace with
previous/next value
○ Replace with a
calculated value
● Try not to exceed 5% of
cases

21
Cleaning missing values (null)
● Removing missing values
○ Dropna()
● Drop rows with missing values
● Drop rows with a certain amount missing
● Replacing missing values
○ Fillna()
● Fill missing values with specific value (average)
● Fill missing values with previous value (586.0)
● Fill missing values with next value (691.0)
○ Interpolate()
● Fill missing values with a calculated value (median)

22
Fixing Outliers
● Difference between outliers
and incorrect values
○ Outliers are valid
● Methods of treating outliers
○ Remove outliers
○ Impute values
○ Use thresholds
○ Normalize the data
○ Use models less affected by
outliers
○ Analyze outliers separately

23
Fixing Outliers – Do’s and Don’ts
● Consider grouping – if you have outliers for people age 70
○ Outliers are valid
● Methods of treating outliers
○ Remove outliers
○ Impute values
○ Use thresholds
○ Normalize the data
○ Use models less affected by outliers
○ Analyze outliers separately

24

You might also like