0% found this document useful (0 votes)

6 views

BI - Lecture04B - Intro To DataWrangling and EDA

Uploaded by

yasir11.work

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

BI - Lecture04B - Intro To DataWrangling and EDA

Uploaded by

yasir11.work

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Intro to

Data Wrangling and EDA

CS 459 Business Intelligence
Data Wrangling
February 24 CS459 - Business Intelligence - Abeera Tariq 2
• Data Wrangling is the process of gathering, collecting, and
transforming Raw data into another format for better
understanding, decision-making, accessing, and analysis
Data Wrangling in less time.
also called Data Munging • All the activity that you do on the raw data to make it “clean”
enough to input to your analytical algorithm is called data
wrangling or data munging. — Shubham Simar Tomar 2016

February 24 CS459 - Business Intelligence - Abeera Tariq 3

1. Discovering
• Getting familiar with the data
• Identify multiple ways to use data for
different purposes – check the
ingredients before cooking a meal
• Data possibly collected from multiple
sources; formatting is required to
understand relationships.

February 24 CS459 - Business Intelligence - Abeera Tariq 4

2. Structuring

• Data structuring transforms raw data

into a structured format for easier
interpretation and analysis.
• Raw data doesn't help analysts because
it's incomplete or incomprehensible.
• It needs to be parsed so that analysts
can extract relevant information.

February 24 CS459 - Business Intelligence - Abeera Tariq 5

3. Cleaning
• People often use data cleaning
and data wrangling
interchangeably. However,
data cleaning is one step in
the data wrangling process.
• Clean and resolve issues with
the data.

February 24 CS459 - Business Intelligence - Abeera Tariq 6

4. Enriching

• After transforming data into a usable format, find whether

data from other datasets can make your analysis more
effective.
• Helps improve quality of the data if it does not meet the
requirement.
• Enrich with data e.g. combine 2 databases where one
contains phone numbers and others don’t.
February 24 CS459 - Business Intelligence - Abeera Tariq 7
5. Validating
• Check for data
accuracy and
quality.
• Data validation
ensures that data
is fit for analysis.

February 24 CS459 - Business Intelligence - Abeera Tariq 8

6. Publishing

• Publish the data after validating.

• Shared as report, electronic document or deposited into a
database which can be processed further to create larger
and more complex structures such as data warehouses.
• Once published, data is all set for analysis.

February 24 CS459 - Business Intelligence - Abeera Tariq 9

Summarizing
6-steps of
Data Wrangling

February 24 CS459 - Business Intelligence - Abeera Tariq 10

Importance of
Data Wrangling
• In data science and data
analysis, the amount of
work that goes into data
wrangling is embodied by
the 80/20 rule – data
scientists typically spend
80% of their time
‘wrangling’ or preparing
data and 20% of their time
actually analyzing the data.

February 24 CS459 - Business Intelligence - Abeera Tariq 11

Exploratory Data Analysis (EDA)

In data science, exploratory data

analysis involves examining the
distribution of various variables in
the dataset, identifying outliers,
finding trends and patterns,
looking for relationships between
variables by using heat maps or
correlation metrics.

February 24 CS459 - Business Intelligence - Abeera Tariq 12

EDA

February 24 CS459 - Business Intelligence - Abeera Tariq 13

Data Wrangling

February
CS459
24 - Business Intelligence - Abeera Tariq 14
Data Cleaning

February 24 CS459 - Business Intelligence - Abeera Tariq 15

Types of dirty data

February 24 CS459 - Business Intelligence - Abeera Tariq 16

Duplicate data

February 24 CS459 - Business Intelligence - Abeera Tariq 17

Outdated Data

February 24 CS459 - Business Intelligence - Abeera Tariq 18

Incomplete data

February 24 CS459 - Business Intelligence - Abeera Tariq 19

Missing Values

February 24 CS459 - Business Intelligence - Abeera Tariq 20

Missing Values

• Every value in every column has a certain probability of being

missing (Rubin, 1976)
• Generally, there is a probability distribution of any column in any data,
i.e., which defines the shape of the probabilities of occurrence of that
column (e.g., bell curve, exponential, logarithmic etc.)
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)

February 24 CS459 - Business Intelligence - Abeera Tariq 21

Missing Values - MCAR

• Missing Completely at Random (MCAR):

• Every column value has the same probability of being missing
• Causes of the missing data are unrelated to the data
• A product weighing scale generates missing data - batteries have
died down
• Sales data for an outlet is missing - outlet closed for maintenance
• ATM data missing over some time period - ATM was being filled
with cash or a technical glitch causing ripples at multiple locations.

February 24 CS459 - Business Intelligence - Abeera Tariq 22

Missing Values

• Missing at Random (MAR):

• Different column values (e.g., different groups) can have different
probabilities of being missing – most common case
• Causes of the missing data are related to the data
• A weighing scale produces more missing values for heavier products
• Sales data missing for teenage customers - no promotion for teenagers
• ATM data is missing for a time period – missing due to weekend or
holidays or due to lower transaction volumes. The missingness is
related to the observed variable (day of the week) but not directly to
the missing values.
February 24 CS459 - Business Intelligence - Abeera Tariq 23
Missing Values

• Missing Not at Random (MNAR):

• When the case cannot be categorized as MCAR or MAR -
probability of being missing is varying for unknown reasons
• Weighing scale gives missing values over time - wearing out -
cannot detect
• Sales data - more and more missing over time – customers
relocating – cannot detect
• ATM data – people coming lesser and lesser – fear of theft

February 24 CS459 - Business Intelligence - Abeera Tariq 24

February 24 CS459 - Business Intelligence - Abeera Tariq 25
Incorrect/Inaccurate Data

• If an online store records double the number of sales in a

certain month, it could lead to an increased average
customer spend value.
• While the data might make it seem like the store is
performing well, this false information could lead to poor
decision-making

February 24 CS459 - Business Intelligence - Abeera Tariq 26

• Incorrect data leads to
incorrect insights
• Will the analysis be
useful?
• A waste of time, energy
and resources.

February 24 CS459 - Business Intelligence - Abeera Tariq 27

Inconsistent Data

February 24 CS459 - Business Intelligence - Abeera Tariq 28

February 24 CS459 - Business Intelligence - Abeera Tariq 29
Types of dirty data

February 24 CS459 - Business Intelligence - Abeera Tariq 30

Data Cleaning

February 24 CS459 - Business Intelligence - Abeera Tariq 31

Problems with the Data

February 24 CS459 - Business Intelligence - Abeera Tariq 32

Interpreting
Histograms and Box
plots
What is a
Histogram?
A histogram is a graphical representation
of the frequency distribution of continuous
series using rectangles.
The x-axis of the graph represents the class
interval, and the y-axis shows the various
frequencies corresponding to different
class intervals

February 24 CS459 - Business Intelligence - Abeera Tariq 34

Analyzing Histograms:
Shape, Skew and Kurtosis

February 24 CS459 - Business Intelligence - Abeera Tariq 35

Mean, Median, Mode

• Mean: The "average" number; found by adding all data points and
dividing by the number of data points.
(impacted by outlier)
• Median: The middle number; found by ordering all data points and
picking out the one in the middle (or if there are two middle numbers,
taking the mean of those two numbers).
(Not impacted by outlier)
• Mode: The most frequent number—that is, the number that occurs the
highest number of times.

February 24 CS459 - Business Intelligence - Abeera Tariq 36

Skew

• Skewness is a statistical measure that assesses the

asymmetry of a probability distribution. It quantifies the
extent to which the data is skewed or shifted to one side.
Positive (long tail on right) and Negative (long tail on left)

February 24 CS459 - Business Intelligence - Abeera Tariq 37

Kurtosis

• Kurtosis is a statistical measure that quantifies the shape of a

probability distribution. It provides information about the
tails and peakedness of the distribution compared to a
normal distribution.
• Positive kurtosis indicates heavier tails and a more peaked
distribution, while negative kurtosis suggests lighter tails
and a flatter distribution.
February 24 CS459 - Business Intelligence - Abeera Tariq 38
Example: Scores on a Test

February 24 CS459 - Business Intelligence - Abeera Tariq 39

Example: Scores on a Test

February 24 CS459 - Business Intelligence - Abeera Tariq 40

Interpreting Box Plots

February 24 CS459 - Business Intelligence - Abeera Tariq 41

Histograms and Box Plots

February 24 CS459 - Business Intelligence - Abeera Tariq 42

1017 Intelligence Quotient IQ Test Interview Questions Answers Guide
No ratings yet
1017 Intelligence Quotient IQ Test Interview Questions Answers Guide
7 pages
Emm410 2
No ratings yet
Emm410 2
18 pages
Supply Chain Management: SESSION 21& 23
No ratings yet
Supply Chain Management: SESSION 21& 23
54 pages
FCMFG-001 r16 Criteria To Repair
No ratings yet
FCMFG-001 r16 Criteria To Repair
9 pages
BI Lecture05A DataWrangling
No ratings yet
BI Lecture05A DataWrangling
51 pages
Introduction
No ratings yet
Introduction
19 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
Data Warehousing AND Data Mining
100% (1)
Data Warehousing AND Data Mining
90 pages
PPT 2
No ratings yet
PPT 2
51 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Unit - 3: Big Data Analytics
No ratings yet
Unit - 3: Big Data Analytics
23 pages
Class-Data Preprocessing-III
No ratings yet
Class-Data Preprocessing-III
53 pages
Data Warehousing: Version 6.0 - 04/18/2000
No ratings yet
Data Warehousing: Version 6.0 - 04/18/2000
60 pages
The Key in Business Is To Know Something That Nobody Else Knows.
No ratings yet
The Key in Business Is To Know Something That Nobody Else Knows.
43 pages
Data - Analytics - Interview - Q and A
No ratings yet
Data - Analytics - Interview - Q and A
64 pages
CPS 196.03: Information Management and Mining: Shivnath Babu
No ratings yet
CPS 196.03: Information Management and Mining: Shivnath Babu
30 pages
Data Warehouse Concepts - Final
0% (1)
Data Warehouse Concepts - Final
60 pages
ITAM Presentation
No ratings yet
ITAM Presentation
46 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Class 3 Introduction
No ratings yet
Class 3 Introduction
32 pages
Data Mining.intro
No ratings yet
Data Mining.intro
17 pages
Introduction To QM IN DECISION MAKING - Use Only This One PDF
No ratings yet
Introduction To QM IN DECISION MAKING - Use Only This One PDF
47 pages
DataMining Overview
No ratings yet
DataMining Overview
52 pages
Unit 1 - Introduction To Data Mining and Data Warehousing
No ratings yet
Unit 1 - Introduction To Data Mining and Data Warehousing
84 pages
Bat 334 Database Management Systems 4
No ratings yet
Bat 334 Database Management Systems 4
23 pages
CS213 - 04 - Data Science
No ratings yet
CS213 - 04 - Data Science
50 pages
Module 1 Ppt1
No ratings yet
Module 1 Ppt1
59 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
No ratings yet
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
UNIT I DBMI
No ratings yet
UNIT I DBMI
35 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
41 pages
Lec#29
No ratings yet
Lec#29
30 pages
QMM Epgdm 1
No ratings yet
QMM Epgdm 1
113 pages
Data Mining
No ratings yet
Data Mining
6 pages
Chapter 3
No ratings yet
Chapter 3
81 pages
Quantitative Methods For Management: Term II 4 Credits MGT 408
No ratings yet
Quantitative Methods For Management: Term II 4 Credits MGT 408
75 pages
Data Mining Overview: by Dr. Sunil D. Lakdawala
No ratings yet
Data Mining Overview: by Dr. Sunil D. Lakdawala
52 pages
Business Systems Intelligence: 8. Wrap Up
No ratings yet
Business Systems Intelligence: 8. Wrap Up
53 pages
Unit 1
No ratings yet
Unit 1
36 pages
Week 9-Ch.8
No ratings yet
Week 9-Ch.8
46 pages
BECE352E Module 2
No ratings yet
BECE352E Module 2
58 pages
Sales Data Analysis and Reporting for a Retail Chain-1
No ratings yet
Sales Data Analysis and Reporting for a Retail Chain-1
1 page
Online Analytical Processing (OLAP) Groupwork
No ratings yet
Online Analytical Processing (OLAP) Groupwork
8 pages
UNIT II
No ratings yet
UNIT II
59 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
42 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
CH 1
No ratings yet
CH 1
66 pages
Study Material I
No ratings yet
Study Material I
140 pages
Mra Unit III
No ratings yet
Mra Unit III
44 pages
Ba Important
No ratings yet
Ba Important
13 pages
DB-14
No ratings yet
DB-14
97 pages
Session 2
No ratings yet
Session 2
13 pages
Data Preprocessing - DWM
No ratings yet
Data Preprocessing - DWM
42 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Business Analytics: A Comprehensive Guide
From Everand
Business Analytics: A Comprehensive Guide
Naila Hina
No ratings yet
Concepts in Engineering Design - Mechanical Engineering Second Year Notes, Books, Ebook PDF Download
100% (1)
Concepts in Engineering Design - Mechanical Engineering Second Year Notes, Books, Ebook PDF Download
149 pages
Types of Bus and Arrangements
No ratings yet
Types of Bus and Arrangements
9 pages
The Need For Next-Generation ROADM Networks: White Paper
No ratings yet
The Need For Next-Generation ROADM Networks: White Paper
15 pages
What Are The Benefits of An LGA (Land Grid Array)
No ratings yet
What Are The Benefits of An LGA (Land Grid Array)
9 pages
Azeemraza 2020
No ratings yet
Azeemraza 2020
16 pages
Scania Helper
100% (19)
Scania Helper
79 pages
RBS 2206 V2
No ratings yet
RBS 2206 V2
29 pages
EIA Methodologies
75% (4)
EIA Methodologies
3 pages
Employment Agreement
No ratings yet
Employment Agreement
2 pages
Material Safety Data Sheet: 1 Identification of The Substance & Company Information
50% (2)
Material Safety Data Sheet: 1 Identification of The Substance & Company Information
4 pages
Class-Xii Project & Assignments:: Subjects
No ratings yet
Class-Xii Project & Assignments:: Subjects
11 pages
Workbook PEA307
No ratings yet
Workbook PEA307
123 pages
Simatic S5 ET 100U Distributed I/Os: Manual
No ratings yet
Simatic S5 ET 100U Distributed I/Os: Manual
275 pages
Accessibility in E
No ratings yet
Accessibility in E
7 pages
Nema-Cc 1
No ratings yet
Nema-Cc 1
57 pages
Let'S Solve A Consulting Case
No ratings yet
Let'S Solve A Consulting Case
9 pages
Training Programme For Energy Manager Training Course
No ratings yet
Training Programme For Energy Manager Training Course
5 pages
Sumatif Akhir Semester (Sas) I Bahasa Inggris
No ratings yet
Sumatif Akhir Semester (Sas) I Bahasa Inggris
4 pages
Pavilion in Architecture
No ratings yet
Pavilion in Architecture
4 pages
What Is Knowledge Consistency Check
No ratings yet
What Is Knowledge Consistency Check
2 pages
DATA Cars1
No ratings yet
DATA Cars1
6 pages
TTTPaperIII PDF
No ratings yet
TTTPaperIII PDF
102 pages
Burgis
No ratings yet
Burgis
24 pages
Trigonometric Equation 2.13
No ratings yet
Trigonometric Equation 2.13
8 pages
Legio VII Fantasy Rules
100% (1)
Legio VII Fantasy Rules
4 pages
EXFP2-04: The Earth'S Horizontal Magnetic Field
No ratings yet
EXFP2-04: The Earth'S Horizontal Magnetic Field
4 pages
Im0380 - MTN SSV DT
No ratings yet
Im0380 - MTN SSV DT
12 pages

BI - Lecture04B - Intro To DataWrangling and EDA

Uploaded by

BI - Lecture04B - Intro To DataWrangling and EDA

Uploaded by

Intro to

Data Wrangling and EDA

February 24 CS459 - Business Intelligence - Abeera Tariq 3

February 24 CS459 - Business Intelligence - Abeera Tariq 4

• Data structuring transforms raw data

February 24 CS459 - Business Intelligence - Abeera Tariq 5

February 24 CS459 - Business Intelligence - Abeera Tariq 6

• After transforming data into a usable format, find whether

February 24 CS459 - Business Intelligence - Abeera Tariq 8

• Publish the data after validating.

February 24 CS459 - Business Intelligence - Abeera Tariq 9

February 24 CS459 - Business Intelligence - Abeera Tariq 10

February 24 CS459 - Business Intelligence - Abeera Tariq 11

In data science, exploratory data

February 24 CS459 - Business Intelligence - Abeera Tariq 12

February 24 CS459 - Business Intelligence - Abeera Tariq 13

February 24 CS459 - Business Intelligence - Abeera Tariq 15

February 24 CS459 - Business Intelligence - Abeera Tariq 16

February 24 CS459 - Business Intelligence - Abeera Tariq 17

February 24 CS459 - Business Intelligence - Abeera Tariq 18

February 24 CS459 - Business Intelligence - Abeera Tariq 19

February 24 CS459 - Business Intelligence - Abeera Tariq 20

• Every value in every column has a certain probability of being

February 24 CS459 - Business Intelligence - Abeera Tariq 21

• Missing Completely at Random (MCAR):

February 24 CS459 - Business Intelligence - Abeera Tariq 22

• Missing at Random (MAR):

• Missing Not at Random (MNAR):

February 24 CS459 - Business Intelligence - Abeera Tariq 24

• If an online store records double the number of sales in a

February 24 CS459 - Business Intelligence - Abeera Tariq 26

February 24 CS459 - Business Intelligence - Abeera Tariq 27

February 24 CS459 - Business Intelligence - Abeera Tariq 28

February 24 CS459 - Business Intelligence - Abeera Tariq 30

February 24 CS459 - Business Intelligence - Abeera Tariq 31

February 24 CS459 - Business Intelligence - Abeera Tariq 32

February 24 CS459 - Business Intelligence - Abeera Tariq 34

February 24 CS459 - Business Intelligence - Abeera Tariq 35

February 24 CS459 - Business Intelligence - Abeera Tariq 36

• Skewness is a statistical measure that assesses the

February 24 CS459 - Business Intelligence - Abeera Tariq 37

• Kurtosis is a statistical measure that quantifies the shape of a

February 24 CS459 - Business Intelligence - Abeera Tariq 39

February 24 CS459 - Business Intelligence - Abeera Tariq 40

February 24 CS459 - Business Intelligence - Abeera Tariq 41

February 24 CS459 - Business Intelligence - Abeera Tariq 42

You might also like