0% found this document useful (0 votes)

2 views

Module 4

The document outlines the process of exploratory data analysis, emphasizing the importance of understanding data structures, completeness, and relationships. It discusses various datasets, such as the Boston House Prices and Iris datasets, and highlights the significance of handling missing data through imputation methods. The document also categorizes missing data into three classes: MCAR, MAR, and NMAR, and warns against simply discarding rows with missing values due to potential biases and inconsistencies.

Uploaded by

Pratham Choubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Module 4

Uploaded by

Pratham Choubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Learning Objectives

● Exploratory Data Analysis

● Hands-on Code
● Understanding missing data

2
Introduction

Where are we?

Exploratory Data
Data Modeling
Analysis
Problem Formulation
Presentation

Data Collection &

Insight/Prediction
Processing

● Raw data preprocessing tools

● Raw data collection & pre-processing ● Data query language (SQL) for search,
● Data collection and preprocessing consists of update Relational DBMS
~80% of time ● Storing semi structured data in XML, JSON
formats
3
Introduction Real world scenario

Table

Structured RDBMS SQL

Exploratory Data
Collected Semi Analysis
NRDBMS NoSQL
Raw Data structured
XML, JSON
Unstructured

Preprocessing

Small public R/Python

datasets CSV
Data Exploration
● The goal of the data exploration is to learn
about the data.

● The data scientist wants to know the basic

characteristics of the data, e.g.,
○ the structure,
○ the size,
○ the completeness (or rather where data is
missing), and
○ the relationships between different parts of
the data.
6
Data Exploration
● The exploration is usually a semi-automated
interactive process in which data scientists
use many different tools to consider
different aspects of the data.

● These tools allow the data scientist to

inspect raw data or preprocessed data, e.g.,
comma-separated values (CSV) files

● In this course we will use tools available in

Python:
○ Statistical measures
○ Visualizations
7
Examples

Boston House Prices

Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).

8
Examples

Boston House Prices

Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
We will focus on exploring these two

9
Examples

Boston House Prices

Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
We will focus on exploring these two

11
Examples: MEDV

Histogram

12
Examples: Boston House Prices / MEDV

Density

13
Examples: Boston House Prices / MEDV

Density + rug

14
Examples: Boston House Prices / MEDV

Density + histogram + rug

15
Examples: Boston House Prices / CRIM

Density + histogram + rug

16
Examples: Boston House Prices / CRIM

Density + histogram +
rug

17
Examples

Iris Dataset

The Iris flower data set is a multivariate

data set introduced by the British
statistician and biologist Ronald Fisher in
his 1936 paper.

The data set consists of 50 samples from

each of three species of Iris (Iris Setosa, Iris
virginica, and Iris versicolor). Four features
were measured from each sample: the
length and the width of the sepals and
petals, in centimeters.

19
Examples: IRIS

20
Examples

Iris Dataset

The Iris flower data set is a multivariate

data set introduced by the British
statistician and biologist Ronald Fisher in
his 1936 paper.

The data set consists of 50 samples from

each of three species of Iris (Iris Setosa, Iris
virginica, and Iris versicolor). Four features
were measured from each sample: the
length and the width of the sepals and
petals, in centimeters.

22
Examples: IRIS

Box plot

23
Examples: IRIS

24
Examples: IRIS

25
Examples: Trend

Air Passengers
Dataset
The classic Box & Jenkins airline
data. Monthly totals of
international airline
passengers, 1949 to 1960.

27
Examples: Trend

Histogram

28
Examples: Trend

Scatter plot

29
Examples: Trend

Line plot

30
Missing Data Example of missing data

● Any occurrence where data for a

variable has not been recorded for
some observation is considered
missing from that observation.

34
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.

Can’t we just drop the row with

missing data?

35
Missing Data
Not a good idea. Why?

36
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.

Can’t we just drop the row with

missing data?

38
Missing Data
Missing data
Not a good idea. Why?

It is wasteful.

● May end up discarding a large portion

of data
● A relatively small amount of missing
data can have a big impact
Discarded data

39
Missing Data
Not a good idea. Why?

Creates inconsistency.

● Difficult to compare models that may

not use same variables

40
Missing Data
Not a good idea. Why?

It may create bias.

● Consider that each row indicates a

country and one of the features indicate
GDP. Poor countries may not report GDP
thus may show as missing data. So our
approach will just drop those poor
countries and data will be biased toward
the rich countries!
41
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random

42
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● Imagine tracking the number of
cars at an intersection over
time using a webcam. But the
Wifi on your laptop fails
occasionally, and you cannot
'Some of the data will be record cars during the outage.
missing simply because of bad The fact that they are missing
luck.' has nothing to do with the cars.
The missing car counts are MCAR.
‘This effectively implies that
causes of the missing data are
unrelated to the data.’
44
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If the chance that a value is missing
can be determined entirely by other
variables in the dataset, then the
data is missing at random.
● Say the webcam is known to shut
down every night from 1am to 5am
to save power.
These missing car counts are MAR.

45
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If data is NMAR, the chance that any
value for the given variable is missing
depends on data which is itself
missing.
● People who do not live in permanent
homes are much more likely to have
missing data in a census because
they less likely to be found by
pollsters.

46
Missing Data
Imputation is the act of filling in missing
data.
● Missing data be filled with predefined
values (e.g. 0).
● It can be filled with predictions of what
the values should be.

48
Missing Data
● Typically, imputation is considered when less
than 20% of the data is missing. The quality of
the imputation depends on both the
proportion of data that is missing, and the
pattern, if any, to the missingness.

● Imputation is only as reliable and valid as the

data it draws from. It isn't a magic method
that makes real information out of nothing.

Shopspeed RS12.16 Menual (En)
50% (2)
Shopspeed RS12.16 Menual (En)
17 pages
Unit2
No ratings yet
Unit2
76 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Unit 1
No ratings yet
Unit 1
26 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
PS-ML-Lect-5-9-Unit-2
No ratings yet
PS-ML-Lect-5-9-Unit-2
114 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
ppt2
No ratings yet
ppt2
57 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Missing Data
No ratings yet
Missing Data
25 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
FDS_U4.pptx
No ratings yet
FDS_U4.pptx
93 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
Missing Data
No ratings yet
Missing Data
14 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
UNIT 2 dt
No ratings yet
UNIT 2 dt
8 pages
Data Set Exploration in Python - v1 - Students
No ratings yet
Data Set Exploration in Python - v1 - Students
58 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Modern Method Web in Ar May 2012
No ratings yet
Modern Method Web in Ar May 2012
45 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
lecture3
No ratings yet
lecture3
53 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
Midterm 1
No ratings yet
Midterm 1
14 pages
Chapter 1. Data Preparation (2)
No ratings yet
Chapter 1. Data Preparation (2)
74 pages
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
No ratings yet
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
20 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
DM Lab Cycle 1
No ratings yet
DM Lab Cycle 1
12 pages
Principles of Data Literacy - Introduction To Data Cheatsheet - Codecademy
No ratings yet
Principles of Data Literacy - Introduction To Data Cheatsheet - Codecademy
6 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Advanced Handling of Missing Data: One-Day Workshop
No ratings yet
Advanced Handling of Missing Data: One-Day Workshop
38 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Unit 1
No ratings yet
Unit 1
21 pages
data science slides
No ratings yet
data science slides
57 pages
10. Ai_foundations of Machine Learning III
No ratings yet
10. Ai_foundations of Machine Learning III
98 pages
Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
Data Analytics Course Session 1-5
100% (1)
Data Analytics Course Session 1-5
252 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
QM 1
No ratings yet
QM 1
58 pages
Quantitative Methods 3
No ratings yet
Quantitative Methods 3
174 pages
Principles-of-Data-Science-WEB-3
No ratings yet
Principles-of-Data-Science-WEB-3
30 pages
chapter_3
No ratings yet
chapter_3
58 pages
Datavisualisation Reader
From Everand
Datavisualisation Reader
Beam Vanwaardenberg
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
An Error Reporting Mechanism (ICMP)
No ratings yet
An Error Reporting Mechanism (ICMP)
23 pages
Quick Installation Guide: 1. Install The SIM Card and Battery
No ratings yet
Quick Installation Guide: 1. Install The SIM Card and Battery
2 pages
Citi Task 1
No ratings yet
Citi Task 1
2 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
Time and Motion
88% (8)
Time and Motion
23 pages
App Inventor Talk
100% (1)
App Inventor Talk
8 pages
Rkam 2024
No ratings yet
Rkam 2024
18 pages
Datasheet EN2-2EX3 6041095 en
No ratings yet
Datasheet EN2-2EX3 6041095 en
4 pages
DEH-X3650UI DEH-X2650UI: Cd Rds Receiver Cd Rds 接收機 ﻣ ﺸ ﻐ ﻞ ﺍ ﺳ ﻄ ﻮﺍ ﻧﺎ ﺕ CD ﻣ ﻊ ﺭ ﺍﺩ ﻳ ﻮ RDS ﺩ ﺳ ﺘ ﮕ ﺎﻩ ﮔﯿ ﺮ ﻧﺪ ﻩ Cd Rds
No ratings yet
DEH-X3650UI DEH-X2650UI: Cd Rds Receiver Cd Rds 接收機 ﻣ ﺸ ﻐ ﻞ ﺍ ﺳ ﻄ ﻮﺍ ﻧﺎ ﺕ CD ﻣ ﻊ ﺭ ﺍﺩ ﻳ ﻮ RDS ﺩ ﺳ ﺘ ﮕ ﺎﻩ ﮔﯿ ﺮ ﻧﺪ ﻩ Cd Rds
72 pages
MichelePelafas ComprehensiveGuide InteriorDesign
No ratings yet
MichelePelafas ComprehensiveGuide InteriorDesign
29 pages
Manual Módulo Acromag
No ratings yet
Manual Módulo Acromag
40 pages
Connecting 2 Arduinos by Bluetooth Using A HC-05 and A HC-06 Pair, Bind, and Link - Martyn Currey
No ratings yet
Connecting 2 Arduinos by Bluetooth Using A HC-05 and A HC-06 Pair, Bind, and Link - Martyn Currey
63 pages
Color in Electronic Display Systems Advantages of Multi primary Displays Michael E. Miller pdf download
100% (1)
Color in Electronic Display Systems Advantages of Multi primary Displays Michael E. Miller pdf download
67 pages
TIM RF VoLTE Service Solution
No ratings yet
TIM RF VoLTE Service Solution
34 pages
Web Technologies Bits
No ratings yet
Web Technologies Bits
7 pages
Halfyearly CS Syllabus
No ratings yet
Halfyearly CS Syllabus
2 pages
Untitled
No ratings yet
Untitled
81 pages
Grade 12 CS Project
No ratings yet
Grade 12 CS Project
28 pages
Effectiveness of Mobile Applications in Teaching Vocabulary!
No ratings yet
Effectiveness of Mobile Applications in Teaching Vocabulary!
7 pages
June 2021 (v2) MS
No ratings yet
June 2021 (v2) MS
10 pages
27 Concurrency
No ratings yet
27 Concurrency
22 pages
unit-3-1-aiml-notes
No ratings yet
unit-3-1-aiml-notes
43 pages
28 - 2D Seismic Response Analysis of Bridge Abundment
No ratings yet
28 - 2D Seismic Response Analysis of Bridge Abundment
26 pages
RCC54 Circular Column Charting
No ratings yet
RCC54 Circular Column Charting
1 page
Super WorldBox - Version History
No ratings yet
Super WorldBox - Version History
30 pages
Consumer Preference For Best Digital Payment Application in India
No ratings yet
Consumer Preference For Best Digital Payment Application in India
30 pages
Computer Project
No ratings yet
Computer Project
35 pages
AKTU - QP20E290QP: Time: 3 Hours Total Marks: 100
100% (1)
AKTU - QP20E290QP: Time: 3 Hours Total Marks: 100
2 pages
Data Science
100% (1)
Data Science
14 pages

Module 4

Uploaded by

Module 4

Uploaded by

Learning Objectives

● Exploratory Data Analysis

Where are we?

Data Collection &

● Raw data preprocessing tools

Structured RDBMS SQL

Small public R/Python

● The data scientist wants to know the basic

● These tools allow the data scientist to

● In this course we will use tools available in

Boston House Prices

Boston House Prices

Boston House Prices

Density + histogram + rug

Density + histogram + rug

The Iris flower data set is a multivariate

The data set consists of 50 samples from

The Iris flower data set is a multivariate

The data set consists of 50 samples from

● Any occurrence where data for a

Can’t we just drop the row with

Can’t we just drop the row with

● May end up discarding a large portion

● Difficult to compare models that may

It may create bias.

● Consider that each row indicates a

● Imputation is only as reliable and valid as the

You might also like