0% found this document useful (0 votes)
2 views

Module 4

The document outlines the process of exploratory data analysis, emphasizing the importance of understanding data structures, completeness, and relationships. It discusses various datasets, such as the Boston House Prices and Iris datasets, and highlights the significance of handling missing data through imputation methods. The document also categorizes missing data into three classes: MCAR, MAR, and NMAR, and warns against simply discarding rows with missing values due to potential biases and inconsistencies.

Uploaded by

Pratham Choubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 4

The document outlines the process of exploratory data analysis, emphasizing the importance of understanding data structures, completeness, and relationships. It discusses various datasets, such as the Boston House Prices and Iris datasets, and highlights the significance of handling missing data through imputation methods. The document also categorizes missing data into three classes: MCAR, MAR, and NMAR, and warns against simply discarding rows with missing values due to potential biases and inconsistencies.

Uploaded by

Pratham Choubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Learning Objectives

● Exploratory Data Analysis


● Hands-on Code
● Understanding missing data

2
Introduction

Where are we?

Exploratory Data
Data Modeling
Analysis
Problem Formulation
Presentation

Data Collection &


Insight/Prediction
Processing

● Raw data preprocessing tools


● Raw data collection & pre-processing ● Data query language (SQL) for search,
● Data collection and preprocessing consists of update Relational DBMS
~80% of time ● Storing semi structured data in XML, JSON
formats
3
Introduction Real world scenario

Table

Structured RDBMS SQL

Exploratory Data
Collected Semi Analysis
NRDBMS NoSQL
Raw Data structured
XML, JSON
Unstructured

Preprocessing

Small public R/Python


datasets CSV
Data Exploration
● The goal of the data exploration is to learn
about the data.

● The data scientist wants to know the basic


characteristics of the data, e.g.,
○ the structure,
○ the size,
○ the completeness (or rather where data is
missing), and
○ the relationships between different parts of
the data.
6
Data Exploration
● The exploration is usually a semi-automated
interactive process in which data scientists
use many different tools to consider
different aspects of the data.

● These tools allow the data scientist to


inspect raw data or preprocessed data, e.g.,
comma-separated values (CSV) files

● In this course we will use tools available in


Python:
○ Statistical measures
○ Visualizations
7
Examples

Boston House Prices


Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).

8
Examples

Boston House Prices


Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
We will focus on exploring these two

9
Examples

Boston House Prices


Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
We will focus on exploring these two

11
Examples: MEDV

Histogram

12
Examples: Boston House Prices / MEDV

Density

13
Examples: Boston House Prices / MEDV

Density + rug

14
Examples: Boston House Prices / MEDV

Density + histogram + rug

15
Examples: Boston House Prices / CRIM

Density + histogram + rug

16
Examples: Boston House Prices / CRIM

Density + histogram +
rug

17
Examples

Iris Dataset

The Iris flower data set is a multivariate


data set introduced by the British
statistician and biologist Ronald Fisher in
his 1936 paper.

The data set consists of 50 samples from


each of three species of Iris (Iris Setosa, Iris
virginica, and Iris versicolor). Four features
were measured from each sample: the
length and the width of the sepals and
petals, in centimeters.

19
Examples: IRIS

20
Examples

Iris Dataset

The Iris flower data set is a multivariate


data set introduced by the British
statistician and biologist Ronald Fisher in
his 1936 paper.

The data set consists of 50 samples from


each of three species of Iris (Iris Setosa, Iris
virginica, and Iris versicolor). Four features
were measured from each sample: the
length and the width of the sepals and
petals, in centimeters.

22
Examples: IRIS

Box plot

23
Examples: IRIS

24
Examples: IRIS

25
Examples: Trend

Air Passengers
Dataset
The classic Box & Jenkins airline
data. Monthly totals of
international airline
passengers, 1949 to 1960.

27
Examples: Trend

Histogram

28
Examples: Trend

Scatter plot

29
Examples: Trend

Line plot

30
Missing Data Example of missing data

● Any occurrence where data for a


variable has not been recorded for
some observation is considered
missing from that observation.

34
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.

Can’t we just drop the row with


missing data?

35
Missing Data
Not a good idea. Why?

36
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.

Can’t we just drop the row with


missing data?

38
Missing Data
Missing data
Not a good idea. Why?

It is wasteful.

● May end up discarding a large portion


of data
● A relatively small amount of missing
data can have a big impact
Discarded data

39
Missing Data
Not a good idea. Why?

Creates inconsistency.

● Difficult to compare models that may


not use same variables

40
Missing Data
Not a good idea. Why?

It may create bias.

● Consider that each row indicates a


country and one of the features indicate
GDP. Poor countries may not report GDP
thus may show as missing data. So our
approach will just drop those poor
countries and data will be biased toward
the rich countries!
41
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random

42
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● Imagine tracking the number of
cars at an intersection over
time using a webcam. But the
Wifi on your laptop fails
occasionally, and you cannot
'Some of the data will be record cars during the outage.
missing simply because of bad The fact that they are missing
luck.' has nothing to do with the cars.
The missing car counts are MCAR.
‘This effectively implies that
causes of the missing data are
unrelated to the data.’
44
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If the chance that a value is missing
can be determined entirely by other
variables in the dataset, then the
data is missing at random.
● Say the webcam is known to shut
down every night from 1am to 5am
to save power.
These missing car counts are MAR.

45
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If data is NMAR, the chance that any
value for the given variable is missing
depends on data which is itself
missing.
● People who do not live in permanent
homes are much more likely to have
missing data in a census because
they less likely to be found by
pollsters.

46
Missing Data
Imputation is the act of filling in missing
data.
● Missing data be filled with predefined
values (e.g. 0).
● It can be filled with predictions of what
the values should be.

48
Missing Data
● Typically, imputation is considered when less
than 20% of the data is missing. The quality of
the imputation depends on both the
proportion of data that is missing, and the
pattern, if any, to the missingness.

● Imputation is only as reliable and valid as the


data it draws from. It isn't a magic method
that makes real information out of nothing.

49

You might also like