0% found this document useful (0 votes)

3 views29 pages

Dele

The document presents an overview of Exploratory Data Analysis (EDA), discussing data sourcing from public and private sources, and the importance of data cleaning. It covers identifying and handling irregularities, data types, and methods for standardizing and filtering data. Additionally, it highlights univariate and bivariate analysis, emphasizing the distinction between correlation and causation.

Uploaded by

elsa.ebby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views29 pages

Dele

Uploaded by

elsa.ebby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

EXPLORATORY DATA ANALYSIS

Apurva Kulkarni, IIIT-Bangalore

SESSION OVERVIEW
Discussing the concepts with illustrative example
Concepts and real world applications
Data Source : data science process diagram - Google Search
EDA-EXPL0RATORY DATA ANALYSIS
EXPLORING AND UNDERSTANDING DATA FOR ANALYSIS

1.DATA 2.UNDERSTAND 3.ANALYZE

1.DATA
What data? From where? How?
Data Sourcing:
Public Data Sources
Private Data Sources
PRIVATE DATA SOURCES
o1. Authority to access the data – NDA: Non Disclosure Agreement
o2. Understanding the Compliance (adhering to rules, regulations, laws, or standards)
o3. Building the compatible interface to use the data
PUBLIC DATA SOURCES
o1. Usually gets accessed through web- Open Data
o2. Understanding the Issues: Quality, completeness, Biases, timeliness etc.
o3. Data Heterogeneity
o4. Data Accessibility- Download or subscribe
DATA-CLEANING
1. Identifying irregularities
2. Handling irregularities
IDENTIFYING IRREGULARITIES
oTypos Employee Age Contact Department Gender
Name Number
oMissing value
Ms FName 21 1234567 Production M
oError SName 451 9867458 Sales F
FName
oDuplicates
FName 1 1234567 Female
HANDLING IRREGULARITIES
Data Type
Fix Rows and Columns
Imputing Values
Outliers
Standardize
Filter Data
DATA TYPE
1. Identifying Datatype
2. Understand- Numeric (discrete or continuous), Categorical (Ordinal Type), Time
and Date, Coordinates

Sensor 2 Date Age Contact Gender

Number
33.897
2/1/2023 ‘21’ 1234567.0 M
33.896
8/10/2020 ’45’ 9867458.0 F
0.0 8/10/2020 ’31’ 1234567.0 Female
DATA TYPE
Location production Location production
District 13 Gadag 13

FIX ROWS AND COLUMNS Gadag

Bengaluru
20

20
Bengaluru
(Rural)
20

(Rural) Belgavi 20
Belgavi Average 17.6
•Understand header, footer, column names
Location Sales Suffix First Last
•Add column name Name Name
BOM 13
•Rename abbreviations or code BLR 20 Ms A B
•Delete- Irregular or unidentified columns/rows HYD Mr C C
20
•Split- Merged cells (URL, Address) PNQ 17 Address

•Merge- split cells (Name) -------, Mumbai, 400097

•Align- Misplaced Data District production -------, Bangalore, 560068
Gadag 13
Bengaluru (Rural) 20

Belgavi 20
FIX ROWS AND COLUMNS
IMPUTING VALUES
•Issues- Blank, NA, XX,999, etc
•Approaches- constant, average, function, external sources (other columns), fill partial
data (70-1970)

Age Salary Contact Department Joining year dependents

Number
35 78000 1234567 d1 99 1
50000 9867458 d2 01 2
56 d2 20 0
28 53000 1234567 d2 48 --
OUTLIERS
•Handling outliers- Imputing, Deletion, Binning, Capping

Age Salary dependents

35 78000 1
450 50000 2
56 5100 0
28 53000 10
IMPUTING VALUES
IMPUTING VALUES
STANDARDIZE
•Text- extra chars, case, same formats for date and name
•Numerical- fix units, scale etc.

Experience Age Contact Gender

Number
16 21 0-1234567 male
8 35 +91-9867458 FEMALE
5 31 1234567 Female
FILTER DATA
•Filter by – row, column, cube, aggregate by granularity

Source : The Rise and Fall of the OLAP Cube (holistics.io)

UNDERSTANDING THE PATTERN
Univariate Analysis
Bivariate Analysis
UNIVARIATE ANALYSIS
1. Categorical
2. Numerical
BIVARIATE ANALYSIS
1. Two Numerical Variables
2. Two Categorical Variable
3. Numeric and Categorical
4. Multivariate
5. Correlation vs Causation
CORRELATION VS CAUSATION

Source: correlation example - Google Search Source : correlation vs causation - Google Search
Data Source : data science process diagram - Google Search
DATASET

Source : Netflix Data: Cleaning, Analysis and Visualization (kaggle.com)

Srs
80% (10)
Srs
25 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Larry Bielawski, David Metcalf-Blended E-Learning-HRD Press (2002) PDF
No ratings yet
Larry Bielawski, David Metcalf-Blended E-Learning-HRD Press (2002) PDF
376 pages
Session 2 - Excel Fundamentals For Data Exploration
100% (1)
Session 2 - Excel Fundamentals For Data Exploration
56 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Performance of Bored Piles Constructed Using Polymer Fluids: Lessons From European Experience
No ratings yet
Performance of Bored Piles Constructed Using Polymer Fluids: Lessons From European Experience
9 pages
Data Quality
No ratings yet
Data Quality
14 pages
WDM Brochure (20240112)
No ratings yet
WDM Brochure (20240112)
10 pages
Concrete Cracking Based On Aci
100% (1)
Concrete Cracking Based On Aci
3 pages
Unit I
No ratings yet
Unit I
57 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Oracle 11G Datapump Overview-Part I
No ratings yet
Oracle 11G Datapump Overview-Part I
36 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
39 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Module 2 Data Science New
No ratings yet
Module 2 Data Science New
57 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
CASE 84300161 EN Preview 1
No ratings yet
CASE 84300161 EN Preview 1
51 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Outlier Detection & Analysis 03
No ratings yet
Outlier Detection & Analysis 03
32 pages
Ariens Tractor Serie 936066 Parts Manual
No ratings yet
Ariens Tractor Serie 936066 Parts Manual
44 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Module 1 - BCS602 - Chapter 02
No ratings yet
Module 1 - BCS602 - Chapter 02
90 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Correlation
No ratings yet
Correlation
14 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Unit 1
No ratings yet
Unit 1
21 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Common Electrical Wire Splices and Joints - Basics About Electrical
100% (1)
Common Electrical Wire Splices and Joints - Basics About Electrical
6 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
DD Fonts
No ratings yet
DD Fonts
9 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Unit-2 - DS Notes
No ratings yet
Unit-2 - DS Notes
22 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Unit-I Da
No ratings yet
Unit-I Da
42 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
MySQL Data Types Quick Reference Table
No ratings yet
MySQL Data Types Quick Reference Table
3 pages
CE413 Lecture 1 Introduction
No ratings yet
CE413 Lecture 1 Introduction
20 pages
Mil Week 2
No ratings yet
Mil Week 2
15 pages
The National Library of The Philippines
No ratings yet
The National Library of The Philippines
14 pages
EF305015250B
No ratings yet
EF305015250B
9 pages
Joy To The World - Full Instrumental
No ratings yet
Joy To The World - Full Instrumental
3 pages
Receipe 80
No ratings yet
Receipe 80
20 pages
Biosemiotics
No ratings yet
Biosemiotics
4 pages
Stress Calculation Stress Engineering Cover Sheet
No ratings yet
Stress Calculation Stress Engineering Cover Sheet
7 pages
Pte 15 Q
No ratings yet
Pte 15 Q
40 pages
Series 1000, 2000, and 3000 Industrial Terminals and Workstations
No ratings yet
Series 1000, 2000, and 3000 Industrial Terminals and Workstations
8 pages
Leeds Mass Balance
No ratings yet
Leeds Mass Balance
32 pages
Wiring Harnesses - Wiring Harness Repair
No ratings yet
Wiring Harnesses - Wiring Harness Repair
49 pages
GC 190
No ratings yet
GC 190
48 pages
SESCO.L90.D.501-12 - List of Material Machinery Part - Rev00 - 14-10-2013
No ratings yet
SESCO.L90.D.501-12 - List of Material Machinery Part - Rev00 - 14-10-2013
74 pages
Path Fınder Scanner Owners-Handbook-173858
No ratings yet
Path Fınder Scanner Owners-Handbook-173858
75 pages
Rockwell Hardness Testing
No ratings yet
Rockwell Hardness Testing
5 pages
Rubber Stop Buffers
No ratings yet
Rubber Stop Buffers
2 pages
Btech Ce 7 Sem Railways Airport and Waterways Rce076 Syllabi
No ratings yet
Btech Ce 7 Sem Railways Airport and Waterways Rce076 Syllabi
3 pages
Log
No ratings yet
Log
2 pages

Dele

Uploaded by

Dele

Uploaded by

EXPLORATORY DATA ANALYSIS

Apurva Kulkarni, IIIT-Bangalore

1.DATA 2.UNDERSTAND 3.ANALYZE

Sensor 2 Date Age Contact Gender

FIX ROWS AND COLUMNS Gadag

•Merge- split cells (Name) -------, Mumbai, 400097

Age Salary Contact Department Joining year dependents

Age Salary dependents

Experience Age Contact Gender

Source : The Rise and Fall of the OLAP Cube (holistics.io)

Source : Netflix Data: Cleaning, Analysis and Visualization (kaggle.com)

You might also like