Module 4
Module 4
2
Introduction
Exploratory Data
Data Modeling
Analysis
Problem Formulation
Presentation
Table
Exploratory Data
Collected Semi Analysis
NRDBMS NoSQL
Raw Data structured
XML, JSON
Unstructured
Preprocessing
8
Examples
9
Examples
11
Examples: MEDV
Histogram
12
Examples: Boston House Prices / MEDV
Density
13
Examples: Boston House Prices / MEDV
Density + rug
14
Examples: Boston House Prices / MEDV
15
Examples: Boston House Prices / CRIM
16
Examples: Boston House Prices / CRIM
Density + histogram +
rug
17
Examples
Iris Dataset
19
Examples: IRIS
20
Examples
Iris Dataset
22
Examples: IRIS
Box plot
23
Examples: IRIS
24
Examples: IRIS
25
Examples: Trend
Air Passengers
Dataset
The classic Box & Jenkins airline
data. Monthly totals of
international airline
passengers, 1949 to 1960.
27
Examples: Trend
Histogram
28
Examples: Trend
Scatter plot
29
Examples: Trend
Line plot
30
Missing Data Example of missing data
34
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.
35
Missing Data
Not a good idea. Why?
36
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.
38
Missing Data
Missing data
Not a good idea. Why?
It is wasteful.
39
Missing Data
Not a good idea. Why?
Creates inconsistency.
40
Missing Data
Not a good idea. Why?
42
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● Imagine tracking the number of
cars at an intersection over
time using a webcam. But the
Wifi on your laptop fails
occasionally, and you cannot
'Some of the data will be record cars during the outage.
missing simply because of bad The fact that they are missing
luck.' has nothing to do with the cars.
The missing car counts are MCAR.
‘This effectively implies that
causes of the missing data are
unrelated to the data.’
44
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If the chance that a value is missing
can be determined entirely by other
variables in the dataset, then the
data is missing at random.
● Say the webcam is known to shut
down every night from 1am to 5am
to save power.
These missing car counts are MAR.
45
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If data is NMAR, the chance that any
value for the given variable is missing
depends on data which is itself
missing.
● People who do not live in permanent
homes are much more likely to have
missing data in a census because
they less likely to be found by
pollsters.
46
Missing Data
Imputation is the act of filling in missing
data.
● Missing data be filled with predefined
values (e.g. 0).
● It can be filled with predictions of what
the values should be.
48
Missing Data
● Typically, imputation is considered when less
than 20% of the data is missing. The quality of
the imputation depends on both the
proportion of data that is missing, and the
pattern, if any, to the missingness.
49