0% found this document useful (0 votes)
24 views24 pages

1.3.1. Exploratory Data Analysis

Uploaded by

havietthang02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views24 pages

1.3.1. Exploratory Data Analysis

Uploaded by

havietthang02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Exploratory Data Analysis

1
Learning Goals
In this section, we will cover:
- Approaches to conducting Exploratory Data Analysis (EDA)
- EDA techniques
- Sampling from DataFrames
- Producing EDA visualizations

2
What is Exploratory Data Analysis?
Exploratory data analysis (EDA) is an approach for analyzing
data sets to summarize their main characteristics, often with visual
methods.

3
Why is EDA Useful?
EDA allows us to get an initial feel for the data.

This lets us determine if the data makes sense, or if further cleaning or


more data is needed.

EDA helps to Identify patterns and trends in the data


(these can be just as important as findings from modeling).

4
Techniques for EDA

Summary Statistics:
Average, Median, Min, Max, Correlations, etc.
Visualizations:
Histograms, Scatter Plots, Box Plots, etc.

5
Tools for EDA
Data Wrangling:
Pandas

Visualization:
Matplotlib, Seaborn

6
EDA: Job Applicant Summary Statistics

Suppose we want to examine characteristics of job applicants:

Average: we could look at the average of all interview scores


(perhaps by city or job function).

Max: we could look at most common words applicants use in


application materials.

Correlations: we could look at the correlations between technical


assessments and years experience
(perhaps by type of experience).
7
Sampling from DataFrames
There are many reasons to consider
random samples from DataFrames:

For large data, a random sample can


make computation easier.

We may want to train models on a


random sample of the data.

We may want to over- or under-sample


observations when outcomes are
uneven.

8
Sampling from DataFrames

9
Visualization Libraries

Visualizations can be created in multiple ways:

- Matplotlib
- Pandas (via Matplotlib)
- Seaborn
● Statistically-focused plotting methods
● Global preferences incorporated by Matplotlib

10
Basic Scatter Plots with Matplotlib

11
Scatter Plots with Multiple Layers

12
Histograms

13
Customizing Plots

14
Customizing Plots: by Group

15
Pair Plots for Features

16
Pair Plots for Features

17
Pair Plots for Features

18
Seaborn Example: Hexbin Plot

19
Seaborn Example: Facet Grid

20
Seaborn Example: Facet Grid

21
Seaborn Example: Facet Grid

22
Summary
● Exploratory Data Analysis
○ EDA is an approach to analyzing data sets that summarizes their main characteristics, often
using visual methods. It helps you determine if the data is usable as-is, or if it needs further
data cleaning.
○ EDA is also important in the process of identifying patterns, observing trends, and formulating
hypothesis.
○ Common summary statistics for EDA include finding summary statistics and producing
visualizations.

23
Learning Recap
In this section, we discussed:
- Approaches to conducting Exploratory Data Analysis (EDA)
- EDA techniques
- Sampling from DataFrames
- Producing EDA visualizations

24

You might also like