6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
(ECE3502)
Module-2
predictive models
tag data to pre-
Tags to organize for forecasting,
process large
data Application of
datasets
predictive models
Data pre-processing – “an important milestone
of the Data Mining Process”
✓ Data mining is a process of discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems.
✓ Data mining is an interdisciplinary subfield of computer
science and statistics with an overall goal to extract information (with
intelligent methods) from a data set and transform the information into a
comprehensible structure for further use.
Data analysis pipeline
Mining is not the only step in the analysis process
Data Result
Preprocessing Data Mining Post-processing
Sampling
Dimensionality Reduction
Feature subset selection
Distance/Similarity Calculation
Visualization
Data Preparation as a step in the Knowledge
Discovery Process
Knowledge
Evaluation and
Presentation
Data Mining
Selection and
Transformation
Cleaning and
Integration
DW
DB
7
Why Prepare Data?
• Data need to be formatted for a given software tool
• Data need to be made adequate for a given method
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”, Age=“222”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
8
• e.g., Endereço: travessa da Igreja de Nevogilde Freguesia: Paranhos
Attribute Values
content
More information
• Categorical scale Qualitative
• Ordinal scale
• Interval scale
Quantitative
• Ratio scale
Discrete or Continuous
Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countable infinite set of values
Examples: zip codes, counts, or the set of words in a
collection of documents
Often represented as integer variables.
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented
using a finite number of digits.
Data Quality
Then,
• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem
Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an outlier if outside limits
(normal distribution assumed)
(x − ks, x + ks)
Outlier detection
• Multivariate
• Clustering
• Very small clusters are outliers
20
21
Recommended reading
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Data Quality: Handle
Noise(Binning)
Binning
sort data and partition into (equi-depth) bins
smooth by bin means, bin median, bin boundaries, etc.
Regression
smooth by fitting a regression function
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values automatically and check by human
Data Quality: Handle
Noise(Binning)
Equal-width binning
Divides the range into N intervals of equal size
Width of intervals:
Simple
Equal-depth binning
Divides the range into N intervals,
each containing approximately same number of records
Skewed data is also handled well
Simple Methods:
Binning
Example: customer ages number
of values
Equi-width
binning: 10-20 20-30 30-40 40-50 50-60 60-70 70-80
0-10
Equi-depth
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Simple Discretization Methods:
Binning
Equal-width (distance) partitioning:
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well.
Then,
Centralization:
Based on fitting a distribution to the data
Distance function between distributions
◼ KL (Kullberg-Leibler) Distance
◼ Mean Centering
Data Transformation:
Normalization
min-max normalization
v − min
v'= (new _ max − new _ min) + new _ min
max − min
z-score normalization
v − mean
v' =
stand _ dev
normalization by decimal scaling
v
v'= j Where j is the smallest integer such that Max(| v' |)<1
10
Example: Data Normalization
Data Transformation: Aggregation
Combining two or more attributes (or objects) into a
single attribute (or object)
Purpose
Data reduction
◼ Reduce the number of attributes or objects
Change of scale
◼ Cities aggregated into regions, states, countries, etc
More “stable” data
◼ Aggregated data tends to have less variability
Data Transformation:
Discretization
Motivation for Discretization
Methods
• Binning (as explained earlier)
• Cluster analysis
Then,
Sampled Data
Big Data
Data Sampling
Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.
Example: What is the average height of a person in Ioannina?
We cannot measure the height of everybody
Using a sample will work almost as well as using the entire data
sets, if the sample is representative
Stratified sampling
Split the data into several partitions; then draw random samples from each
partition
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement
As each item is selected, it is removed from the population
Sampling with replacement
Objects are not removed from the population as they are selected for the
sample.
◼ In sampling with replacement, the same object can be picked up more than once. This makes
analytical computation of probabilities easier
◼ E.g., we have 100 people, 51 are women P(W) = 0.51, 49 men
P(M) = 0.49. If I pick two persons what is the probability P(W,W)
that both are women?
◼ Sampling with replacement: P(W,W) = 0.51
2
◼ Sampling without replacement: P(W,W) = 51/100 * 50/99
Sample Size
Data Result
Data Mining
Preprocessing Post-processing
Then,
Objectives:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining
algorithms
Observation: Certain Dimensions are correlated
Dimensionality
Reduction
Allow data to be more easily visualized
Techniques
Then,
Size Pattern
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
n
dist = k
( p − qk ) 2
k=1
( y − x )2 + ( y − x )2 + ...+ ( y − x )2
1 1 2 2 n n
David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources:
http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Euclidean
Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Sample Correlation Matrix
-1 0 +1
business acreage
nitrous oxide
average # rooms
Data on characteristics
of Boston surburbs
Median house value
percentage of large residential lots
Summary
Regression Classification
Regression
Input ML Regression
Data
x
x
http://www.bom.gov.au/watl/about/about-latest-weather- https://walletinvestor.com/forex-forecast/eur-usd-prediction
graphs.shtml
Classification
Input
ML Classifier
Data
Machine Prediction:
e.g. E-mail text
Learning Category
Classification
x
https://towardsdatascience.com/applied-text-classification-on-email-spam- filtering-part-1-1861e1a83246
Regression vs
Classification
{cheap,
Input affordable,
ML Classification
Data expensive }
Applications of Predictive Modelling
Analytical customer relationship management (CRM)
Health Care
Collection Analytics
Cross-cell
Fraud detection
Risk management
❖ Industry Applications
Predictive modelling are used in insurance, banking, marketing,
financial services, telecommunications, retail, travel, healthcare, oil
& gas and other industries.
Predictive Models in Retail industry
• Campaign Response Model – this model predicts the
likelihood that a customer responds to a specific campaign by
purchasing a products solicited in the campaign. The model also
predicts the amount of the purchase given response.
➢ Regression models
➢ Customer Segmentation
➢ Cross-Sell and Upsell
➢ New Product Recommendation
➢ Customer Retention/Loyalty/Churn
➢ Inventory Management
• Will this customer move their business to a
different company?
SAS Analytics
STATISTICA
IBM Predictive Analytics
MATLAB
Minitab