EDA and Cleaning
EDA and Cleaning
1
Learning Objectives
● Understand what Exploratory Data Analysis (EDA) is
● Identify the goals and purpose of EDA
● Explore a Data Quality Report and why it is useful
● Learn key parts of EDA including checking for missing and duplicate values,
creating and leveraging data visualizations, identifying outliers, and
interpreting correlation
2
EDA and Data Cleaning
Business
Deployment
Understanding
Data Cleaning
3
Best Practices
● Goals of Exploratory Data Analysis
○ Understand your data and variables
○ Analyze relationships between
variables
● Purpose of Exploratory Data
Analysis
○ Get maximum insights
○ Uncover the underlying structure
○ Identify important features
○ Detect any issues or missing values
4
First Things First: Check for duplicate values!
● Important for accurate results and interpretation
○ Pandas: dataFrame.duplicated ()
○ R
● duplicated(): for identifying duplicated elements
● unique(): for extracting unique elements
● distinct() [dplyr package] to remove duplicate rows in a data frame
○ SPSS:
● https://www.ibm.com/support/pages/how-identify-duplicate-cases-ibm-spss-statistics
● Options for wizard and syntax
● Check for duplicates across all columns and subsets
● Make sure you have a unique identifier!!!! See pandas.unique()
5
Data Quality Report
● Tabular reports describing each feature in the data: python example
6
Data Quality Report
● Tabular reports describing each feature in the data
7
Checking for missing
values
● Critical for accurate understanding
and modeling
● Types of missing values
○ Missing Completely at Random
(MCAR)
○ Missing at Random (MAR)
○ Missing Not At Random (MNAR)
● Pandas isnull() function
8
There are certain steps you should ALWAYS DO
● Check the distribution of your DV
● Check the distribution of ALL IVs you will be testing
● When you look at distributions you must analyze with context – this means
you need to review data dictionary documentation
○ How was the data collected?
○ Does the documentation match the distributions you see?
○ Is data labeled and do the labels make sense?
○ Are there outliers or weird values????
● Look at the distribution of the DV for EACH IV – this could mean a bar chart,
scatter plot, box and whisker, line graph, etc.
9
Data Visualization
● Crucial for understanding patterns
and trends
○ Univariate – 1 variable
○ Bivariate – 2 variables
○ Multivariate – 3+ variables
● Never skip straight to multivariate
analysis! Really look at your
distribution and what it tells you
about your variables
10
Histograms
● Commonly used in data science
● Show the distribution of features
● Used in univariate and bivariate
analysis
11
Detecting outliers
● Outliers
○ Data points that differ significantly from others
○ Can negatively affect analysis
○ Result in lower accuracy in ML training
○ Caused by
■ Measurement or sampling errors
■ Human errors
■ Natural deviations
○ You need to know if data points are outliers
(valid) or errors (invalid)
● Identify outliers with data visualization
○ Box Plots and Scatterplots
12
Correlation
● Measures strength and direction of
relationship between variables
● Ranges from -1 to +1
● Positive correlation means that as one
variables increases (or decreases), the
other also increases (or decreases)
● Negative correlation means that as
one variable increases, the other
decreases
● Larger correlation values means
stronger relationships
● Correlation of 0 means no relationship
13
Questions you should be able to answer right away
● What is your dependent variable and how is it measured?
● What are your key independent variables? How are they measured?
● What does the distribution of your DV and IVs look like?
● Have you run any visuals that would indicate a possible relationship
between any of your IVs and the DV?
14
Group EDA Assignment
● This assignment will be submitted as a group
● Complete exploratory data analysis on a data set for your project
● You must include code and the output of your code in one document that is
in blackboard readable format (word doc, pdf, etc.)
● Each person in your group must create at least one visual and put their
name as a comment in the applicable code section
○ Only group members that actual code a visual with their name in the comment
will receive credit
15
Learning Objectives
● Understand what data cleaning is and its relationship with Exploratory Data
Analysis (EDA)
● Learn how to clean duplicate values, erroneous values, and missing values
● Explore options for cleaning outliers
16
Data Cleaning
● What is Data Cleaning?
○ Process of fixing/removing incorrect,
duplicate, or incomplete data
● Why is Data Cleaning important?
○ Without it, outcomes are unreliable
● How is Data Cleaning accomplished?
○ No single solution fits all data
○ Varies from dataset to dataset
17
CLEANING DUPLICATE VALUES
● How to clean duplicate values?
18
CLEANING DUPLICATE VALUES
● How to clean duplicate values?
○ Remove them! 99% of the time this is what you want to do
○ Understand why this occurred if possible
19
Data quality reports give insights on erroneous values
● What stands out?
20
Cleaning erroneous values
● Remove rows with
incorrect values
● Replace incorrect values
○ Replace with specific
value
○ Replace with
previous/next value
○ Replace with a
calculated value
● Try not to exceed 5% of
cases
21
Cleaning missing values (null)
● Removing missing values
○ Dropna()
● Drop rows with missing values
● Drop rows with a certain amount missing
● Replacing missing values
○ Fillna()
● Fill missing values with specific value (average)
● Fill missing values with previous value (586.0)
● Fill missing values with next value (691.0)
○ Interpolate()
● Fill missing values with a calculated value (median)
22
Fixing Outliers
● Difference between outliers
and incorrect values
○ Outliers are valid
● Methods of treating outliers
○ Remove outliers
○ Impute values
○ Use thresholds
○ Normalize the data
○ Use models less affected by
outliers
○ Analyze outliers separately
23
Fixing Outliers – Do’s and Don’ts
● Consider grouping – if you have outliers for people age 70
○ Outliers are valid
● Methods of treating outliers
○ Remove outliers
○ Impute values
○ Use thresholds
○ Normalize the data
○ Use models less affected by outliers
○ Analyze outliers separately
24