0% found this document useful (0 votes)
3 views29 pages

Dele

The document presents an overview of Exploratory Data Analysis (EDA), discussing data sourcing from public and private sources, and the importance of data cleaning. It covers identifying and handling irregularities, data types, and methods for standardizing and filtering data. Additionally, it highlights univariate and bivariate analysis, emphasizing the distinction between correlation and causation.

Uploaded by

elsa.ebby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views29 pages

Dele

The document presents an overview of Exploratory Data Analysis (EDA), discussing data sourcing from public and private sources, and the importance of data cleaning. It covers identifying and handling irregularities, data types, and methods for standardizing and filtering data. Additionally, it highlights univariate and bivariate analysis, emphasizing the distinction between correlation and causation.

Uploaded by

elsa.ebby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

EXPLORATORY DATA ANALYSIS

Apurva Kulkarni, IIIT-Bangalore


SESSION OVERVIEW
Discussing the concepts with illustrative example
Concepts and real world applications
Data Source : data science process diagram - Google Search
EDA-EXPL0RATORY DATA ANALYSIS
EXPLORING AND UNDERSTANDING DATA FOR ANALYSIS

1.DATA 2.UNDERSTAND 3.ANALYZE


1.DATA
What data? From where? How?
Data Sourcing:
Public Data Sources
Private Data Sources
PRIVATE DATA SOURCES
o1. Authority to access the data – NDA: Non Disclosure Agreement
o2. Understanding the Compliance (adhering to rules, regulations, laws, or standards)
o3. Building the compatible interface to use the data
PUBLIC DATA SOURCES
o1. Usually gets accessed through web- Open Data
o2. Understanding the Issues: Quality, completeness, Biases, timeliness etc.
o3. Data Heterogeneity
o4. Data Accessibility- Download or subscribe
DATA-CLEANING
1. Identifying irregularities
2. Handling irregularities
IDENTIFYING IRREGULARITIES
oTypos Employee Age Contact Department Gender
Name Number
oMissing value
Ms FName 21 1234567 Production M
oError SName 451 9867458 Sales F
FName
oDuplicates
FName 1 1234567 Female
HANDLING IRREGULARITIES
Data Type
Fix Rows and Columns
Imputing Values
Outliers
Standardize
Filter Data
DATA TYPE
1. Identifying Datatype
2. Understand- Numeric (discrete or continuous), Categorical (Ordinal Type), Time
and Date, Coordinates

Sensor 2 Date Age Contact Gender


Number
33.897
2/1/2023 ‘21’ 1234567.0 M
33.896
8/10/2020 ’45’ 9867458.0 F
0.0 8/10/2020 ’31’ 1234567.0 Female
DATA TYPE
Location production Location production
District 13 Gadag 13

FIX ROWS AND COLUMNS Gadag

Bengaluru
20

20
Bengaluru
(Rural)
20

(Rural) Belgavi 20
Belgavi Average 17.6
•Understand header, footer, column names
Location Sales Suffix First Last
•Add column name Name Name
BOM 13
•Rename abbreviations or code BLR 20 Ms A B
•Delete- Irregular or unidentified columns/rows HYD Mr C C
20
•Split- Merged cells (URL, Address) PNQ 17 Address

•Merge- split cells (Name) -------, Mumbai, 400097


•Align- Misplaced Data District production -------, Bangalore, 560068
Gadag 13
Bengaluru (Rural) 20

Belgavi 20
FIX ROWS AND COLUMNS
IMPUTING VALUES
•Issues- Blank, NA, XX,999, etc
•Approaches- constant, average, function, external sources (other columns), fill partial
data (70-1970)

Age Salary Contact Department Joining year dependents


Number
35 78000 1234567 d1 99 1
50000 9867458 d2 01 2
56 d2 20 0
28 53000 1234567 d2 48 --
OUTLIERS
•Handling outliers- Imputing, Deletion, Binning, Capping

Age Salary dependents

35 78000 1
450 50000 2
56 5100 0
28 53000 10
IMPUTING VALUES
IMPUTING VALUES
STANDARDIZE
•Text- extra chars, case, same formats for date and name
•Numerical- fix units, scale etc.

Experience Age Contact Gender


Number
16 21 0-1234567 male
8 35 +91-9867458 FEMALE
5 31 1234567 Female
FILTER DATA
•Filter by – row, column, cube, aggregate by granularity

Source : The Rise and Fall of the OLAP Cube (holistics.io)


UNDERSTANDING THE PATTERN
Univariate Analysis
Bivariate Analysis
UNIVARIATE ANALYSIS
1. Categorical
2. Numerical
BIVARIATE ANALYSIS
1. Two Numerical Variables
2. Two Categorical Variable
3. Numeric and Categorical
4. Multivariate
5. Correlation vs Causation
CORRELATION VS CAUSATION

Source: correlation example - Google Search Source : correlation vs causation - Google Search
Data Source : data science process diagram - Google Search
DATASET

Source : Netflix Data: Cleaning, Analysis and Visualization (kaggle.com)

You might also like