0% found this document useful (0 votes)
88 views33 pages

Log Linear Models and Logistic Regression Springer Texts in Statistics

This document discusses the steps of data exploration in data mining. It covers univariate analysis of both categorical and numerical variables, including frequency counts, charts, and summary statistics. It also discusses bivariate analysis through correlation measures and combination charts. Additionally, it addresses handling missing data, outliers, and transforming categorical and numerical variables through encoding and binning techniques.

Uploaded by

sarvesh_mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views33 pages

Log Linear Models and Logistic Regression Springer Texts in Statistics

This document discusses the steps of data exploration in data mining. It covers univariate analysis of both categorical and numerical variables, including frequency counts, charts, and summary statistics. It also discusses bivariate analysis through correlation measures and combination charts. Additionally, it addresses handling missing data, outliers, and transforming categorical and numerical variables through encoding and binning techniques.

Uploaded by

sarvesh_mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Exploration

Dr. Saed Sayad


University of Toronto
2010
[email protected]

http://chem-eng.utoronto.ca/~datamining/ 1
Data Mining Steps
1 • Problem Definition

2 • Data Preparation

3 • Data Exploration

4 • Modeling

5 • Evaluation

6 • Deployment

http://chem-eng.utoronto.ca/~datamining/ 2
1. Problem Definition
Understanding the project objectives
and requirements from a business
perspective, converting this knowledge
into a data mining problem definition,
and a preliminary plan designed to
achieve the objectives.

Source: http://www.crisp-dm.org/Process/index.htm

http://chem-eng.utoronto.ca/~datamining/ 3
2. Data Preparation

Data ETL
DSN

Data
Text

Modeling Data

http://chem-eng.utoronto.ca/~datamining/ 4
3. Data Exploration
Frequency, Mean,
Min, Max, ...
Univariate
Analysis
Bar, Line, Pie, ...
Charts
Data
Exploration
Correlation
Z test, ...
Bivariate Analysis
Combination
Charts

http://chem-eng.utoronto.ca/~datamining/ 5
Data Exploration - Univariate Analysis
Count,
Frequency
Categoical
Bar and Pie
Charts
Univaiate
Count, Mean,
StDev
Numerical
Histogram,
Box Plot

http://chem-eng.utoronto.ca/~datamining/ 6
Univariate Analysis - Categorical
housing Count Frequency 11% Housing
18%
for free 96 10.67%
for free
own 641 71.22% own

rent 163 18.11%


rent
71%

Housing
700
600
500
400
300
200
100
0

for free own rent


http://chem-eng.utoronto.ca/~datamining/ 7
Missing Values

83% Education
2,500,000
Missing Value
2,000,000
Frequency

1,500,000

1,000,000

500,000

0
1

4
K
AN
BL

http://chem-eng.utoronto.ca/~datamining/ 8
Invalid Values

Invalid doc_type_id
1,400,000
1,200,000
1,000,000
Frequency

800,000
600,000
400,000
200,000
0
LL

X
Z

3
NU

http://chem-eng.utoronto.ca/~datamining/ 9
Univariate Analysis - Numeric

Age
Count 900 Average 35.25 StDev 11.20

Min 19 Median 33 Variance 125.37

Maximum 75 Mode 27 CV 32%

Range 56 Skewness 1.09


Missing 0 Kurtosis 0.88

http://chem-eng.utoronto.ca/~datamining/ 10
Missing and Invalid Values and Outliers
Months in Business

http://chem-eng.utoronto.ca/~datamining/ 11
Box Plot

Outliers

http://chem-eng.utoronto.ca/~datamining/ 12
Univariate Analysis - Policies
Variable
Categorical Numeric

Missing Values Missing Values

Invalid Values Invalid & Outliers

Encoding Binning

http://chem-eng.utoronto.ca/~datamining/ 13
Missing Value Policies
• Fill in missing values manually based on our
domain knowledge
• Ignore the records with missing data
• Fill in it automatically:
– A global constant (e.g., “?”)
– The variable mean
– Inference-based methods such as Bayes’ rule,
decision tree, or EM algorithm

http://chem-eng.utoronto.ca/~datamining/ 14
Managing Outliers
• Data points inconsistent with the majority of data
• Different outliers
– Valid: CEO’s salary
– Noisy: One’s age = 200, widely deviated points
• Removal methods
– Box plot
– Clustering
– Curve-fitting

http://chem-eng.utoronto.ca/~datamining/ 15
Encoding Categorical Variables
• Encoding is the process of transforming
categorical variables into numerical
counterparts.

• Encoding methods:
– Binary method
– Ordinal Method
– Target based Encoding

http://chem-eng.utoronto.ca/~datamining/ 16
Encoding

Housing (for free, own, rent)

• Binary method: • Ordinal method:


– for free: 1, 0, 0 – own: 1
– own: 0, 1, 0 – for free: 3
– rent: 0, 0, 1 – rent: 5

http://chem-eng.utoronto.ca/~datamining/ 17
Binning Numerical Variables
• Binning is the process of transforming
numerical variables into categorical
counterparts.

• Binning methods:
–Equal Width
–Equal Frequency
–Entropy Based

http://chem-eng.utoronto.ca/~datamining/ 18
Binning
• Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equi-width binning:
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
• Equi-frequency binning :
– Bin 1: 0, 4, 12 [-, 14) bin
– Bin 2: 16, 16, 18 [14, 21) bin
– Bin 3: 24, 26, 28 [21,+) bin
http://chem-eng.utoronto.ca/~datamining/ 19
Binning
Months in Business

http://chem-eng.utoronto.ca/~datamining/ 20
Data Exploration – Bivariate Analysis
Correlation
Numeric Numeric
Scatter Plot

z-test, t-test,
Bivariate
ANOVA
Numeric
Combination
Chart
Categorical
Chi2 test
Categorical
Combination
Chart

http://chem-eng.utoronto.ca/~datamining/ 21
Numeric & Numeric

$120,000

Correlation = 0.114
$100,000

$80,000

Total $60,000
Balance
$40,000

$20,000

$0
0 200 400 600 800 1000 1200 1400 1600 1800 2000

Months n Business

http://chem-eng.utoronto.ca/~datamining/ 22
Categorical & Numeric

Total Balance Total Balance


Default
Average Variance
N $22,994 $3,250
Y $26,874 $3,872

Is there any significant difference the balance average in two groups?

Is there any significant difference the balance variance in two groups?

http://chem-eng.utoronto.ca/~datamining/ 23
Categorical & Numeric

Z test t test

F test ANOVA

http://chem-eng.utoronto.ca/~datamining/ 24
Categorical & Numeric - Z, t, F Tests

X1  X 2 X1  X 2
Z t
S12 S 22  1 1 
 S 
2
 
N1 N 2  N1 N 2 

2
S
F 1
2
S 2

http://chem-eng.utoronto.ca/~datamining/ 25
Analysis of Variance (ANOVA)

Source of Sum of Degree of


Mean Square F P
Variation Squares Freedom

Between Groups SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)

Within Groups SSW dfw MSW = SSW/dfw

Total SST dfT

http://chem-eng.utoronto.ca/~datamining/ 26
Categorical & Categorical

Default
Y N
Y 366 2786
Corporation
N 191 4777

Is the rate of default different between two types of businesses?

http://chem-eng.utoronto.ca/~datamining/ 27
Categorical & Categorical

Default
Y N
Y 4.5% 34.3%
Corporation
N 2.4% 58.8%

http://chem-eng.utoronto.ca/~datamining/ 28
Categorical & Categorical

60%

50%

40%

30%

20%

10% Corporation N

0%
Corporation Y
Y
N
Default

http://chem-eng.utoronto.ca/~datamining/ 29
Categorical & Categorical

r c (nij  eij ) 2

  
2

i 1 j 1 eij
ni.n. j
eij 
n
df  (r  1)(c  1)

http://chem-eng.utoronto.ca/~datamining/ 30
Data Exploration - MVP
Months in Business and Default

Default%

http://chem-eng.utoronto.ca/~datamining/ 31
Summary
• Data exploration covers all activities in order to
get familiar with the data, to identify data quality
problems to discover first insights into the data.
• Univariate analysis can show variable
distribution, missing values, invalid values and
outliers.
• Bivariate analysis can discover relationships
between variables.
• The combination chart (variable & target) is the
most valuable type of plot.
http://chem-eng.utoronto.ca/~datamining/ 32
http://chem-eng.utoronto.ca/~datamining/ 33

You might also like