0% found this document useful (0 votes)
16 views85 pages

6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024

Uploaded by

Rahul tater
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views85 pages

6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024

Uploaded by

Rahul tater
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

IoT Domain Analyst

(ECE3502)
Module-2

Dr. Biswajit Dwivedy

School of Electronics Engineering


VIT University, Vellore, India
Module 2

predictive models
tag data to pre-
Tags to organize for forecasting,
process large
data Application of
datasets
predictive models
Data pre-processing – “an important milestone
of the Data Mining Process”
✓ Data mining is a process of discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems.
✓ Data mining is an interdisciplinary subfield of computer
science and statistics with an overall goal to extract information (with
intelligent methods) from a data set and transform the information into a
comprehensible structure for further use.
Data analysis pipeline
 Mining is not the only step in the analysis process

Data Result
Preprocessing Data Mining Post-processing

 Preprocessing: real data is noisy, incomplete and inconsistent.


Data cleaning is required to make sense of the data
◼ Techniques: Sampling, Dimensionality Reduction, Feature
Selection.
 Post-Processing: Make the data actionable and useful to the
user : Statistical analysis of importance & Visualization.
Major Tasks in Data Preparation
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results
5
Data Preprocessing
 Attribute Values
 Attribute Transformation
 Normalization (Standardization)
 Aggregation
 Discretization

 Sampling
 Dimensionality Reduction
 Feature subset selection
 Distance/Similarity Calculation
 Visualization
Data Preparation as a step in the Knowledge
Discovery Process
Knowledge
Evaluation and
Presentation

Data Mining

Selection and
Transformation

Cleaning and
Integration
DW

DB
7
Why Prepare Data?
• Data need to be formatted for a given software tool
• Data need to be made adequate for a given method
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”, Age=“222”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
8
• e.g., Endereço: travessa da Igreja de Nevogilde Freguesia: Paranhos
Attribute Values

Data is described using attribute values


Attribute Values
 Attribute values are numbers or symbols assigned to an
attribute
 Distinction between attributes and attribute values
 Same attribute can be mapped to different attribute values
◼ Example: height can be measured in feet or meters

 Different attributes can be mapped to the same set of


values
◼ Example: Attribute values for ID and age are integers
◼ But properties of attribute values can be different
◼ ID has no limit but age has a maximum and minimum value
Types of Attributes
 There are different types of attributes
 Nominal
◼ Examples: ID numbers, eye color, zip codes
 Ordinal
◼ Examples: rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
 Interval
◼ Examples: calendar dates
 Ratio
◼ Examples: length, time, counts
• Nominal scale

content
More information
• Categorical scale Qualitative

• Ordinal scale

• Interval scale
Quantitative
• Ratio scale
Discrete or Continuous
Discrete and Continuous Attributes
 Discrete Attribute
 Has only a finite or countable infinite set of values
 Examples: zip codes, counts, or the set of words in a
collection of documents
 Often represented as integer variables.

 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented
using a finite number of digits.
Data Quality

Data has attribute values

Then,

How good our Data w.r.t. these attribute values?


Data Quality
 Examples of data quality problems:
 Noise and outliers Tid Refund Marital Taxable
Cheat
 Missing values
Status Income

1 Yes Single 125K No


 Duplicate data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes


6 No NULL 60K No

Missing values 7 Yes Divorced 220K NULL


8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Data Quality: Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on
a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


Data Quality: Outliers
 Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
Outliers
• Outliers are values thought to be out of range.
• “An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a
different mechanism”
• Can be detected by standardizing observations and label the standardized values
outside a predetermined bound as outliers
• Outlier detection can be used for fraud detection or data cleaning

• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem
Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an outlier if outside limits
(normal distribution assumed)

(x − ks, x + ks)
Outlier detection
• Multivariate

• Clustering
• Very small clusters are outliers

20
21
Recommended reading

Only with hard work


and a favorable
context you will have
the chance to become
an outlier!!!
Data Quality: Missing
Values
 Reasons for missing values
 Information is not collected
(e.g., people decline to give their age and weight)
 Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values


 Eliminate Data Objects
 Estimate Missing Values
 Ignore the Missing Value During Analysis
 Replace with all possible values (weighted by their
probabilities)
Data Quality: Duplicate Data

 Data set may include data objects that are duplicates,


or almost duplicates of one another
 Major issue when merging data from heterogeous sources

 Examples:
 Same person with multiple email addresses

 Data cleaning
 Process of dealing with duplicate data issues
Data Quality: Handle
Noise(Binning)
 Binning
 sort data and partition into (equi-depth) bins
 smooth by bin means, bin median, bin boundaries, etc.

 Regression
 smooth by fitting a regression function
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values automatically and check by human
Data Quality: Handle
Noise(Binning)
 Equal-width binning
 Divides the range into N intervals of equal size
 Width of intervals:

 Simple

 Outliers may dominate result

 Equal-depth binning
 Divides the range into N intervals,
each containing approximately same number of records
 Skewed data is also handled well
Simple Methods:
Binning
Example: customer ages number
of values

Equi-width
binning: 10-20 20-30 30-40 40-50 50-60 60-70 70-80
0-10

Equi-depth
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well.

 Equal-depth (frequency) partitioning:


 Divides the range into N intervals, each containing approximately same
number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Data Quality: Handle
Noise(Binning)
Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
* Partition into three (equi-depth) bins
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Quality: Handle
Noise(Regression)
• Replace noisy or
y
missing values by
predicted values
Y1
• Requires model of
attribute dependencies
(maybe wrong!) Y1’ y=x+1

• Can be used for data


smoothing or for
X1 x
handling missing data
Data Quality
There are many more noise handling techniques
….
> Imputation
Data
Transformation
Data has an attribute values

Then,

Can we compare these attribute values?

For Example: Compare following two records


(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg)
Vs.
(3) (5.9 ft, 50 Kg)
(4) (5.6 ft, 56 Kg)

We need Data Transformation to makes different


dimension(attribute) records comparable …
Data Transformation
Techniques
 Normalization: scaled to fall within a small, specified range.
 min-max normalization
 z-score normalization .... Z-scores are a way to compare results to a “normal”
population.
 normalization by decimal scaling

 Centralization:
 Based on fitting a distribution to the data
 Distance function between distributions
◼ KL (Kullberg-Leibler) Distance
◼ Mean Centering
Data Transformation:
Normalization
 min-max normalization
v − min
v'= (new _ max − new _ min) + new _ min
max − min
 z-score normalization
v − mean
v' =
stand _ dev
 normalization by decimal scaling

v
v'= j Where j is the smallest integer such that Max(| v' |)<1
10
Example: Data Normalization
Data Transformation: Aggregation
 Combining two or more attributes (or objects) into a
single attribute (or object)

 Purpose
 Data reduction
◼ Reduce the number of attributes or objects
 Change of scale
◼ Cities aggregated into regions, states, countries, etc
 More “stable” data
◼ Aggregated data tends to have less variability
Data Transformation:
Discretization
 Motivation for Discretization

 Some data mining algorithms only accept categorical


attributes

 May improve understandability of patterns


Data Transformation:
Discretization
 Task
 Reduce the number of values for a given continuous attribute
by partitioning the range of the attribute into intervals
 Interval labels replace actual attribute values

 Methods
• Binning (as explained earlier)

• Cluster analysis

• Entropy-based Discretization (Supervised)


Data Sampling
Data may be Big

Then,

Can we make is it Small by selecting some part of it?

Data Sampling can do this…

“Sampling is the main technique employed for data selection.”


Data Sampling

Sampled Data

Big Data
Data Sampling
 Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.
 Example: What is the average height of a person in Ioannina?
 We cannot measure the height of everybody

 Sampling is used in data mining because processing the entire set of


data of interest is too expensive or time consuming.
 Example: We have 1M documents. What fraction has at least 100 words in
common?
 Computing number of common words for all pairs requires
10^12 comparisons
Data Sampling …
 The key principle for effective sampling is the following:

 Using a sample will work almost as well as using the entire data
sets, if the sample is representative

 A sample is representative if it has approximately the same


property (of interest) as the original set of data

 Otherwise we say that the sample introduces some bias

 What happens if we take a sample from the university campus to


compute the average height of a person at Vellore?
Types of Sampling
 Simple Random Sampling
 There is an equal probability of selecting any particular item

 Sampling without replacement


 As each item is selected, it is removed from the population

 Sampling with replacement


 Objects are not removed from the population as they are selected for the sample.
◼ In sampling with replacement, the same object can be picked up more than once

 Stratified sampling
 Split the data into several partitions; then draw random samples from each
partition
Types of Sampling
 Simple Random Sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 As each item is selected, it is removed from the population
 Sampling with replacement
 Objects are not removed from the population as they are selected for the
sample.
◼ In sampling with replacement, the same object can be picked up more than once. This makes
analytical computation of probabilities easier
◼ E.g., we have 100 people, 51 are women P(W) = 0.51, 49 men
P(M) = 0.49. If I pick two persons what is the probability P(W,W)
that both are women?
◼ Sampling with replacement: P(W,W) = 0.51
2
◼ Sampling without replacement: P(W,W) = 51/100 * 50/99
Sample Size

8000 points 2000 Points 500 Points


Example: Data Collection
Data Collection

Data Result
Data Mining
Preprocessing Post-processing

 Today there is an abundance of data online


 Facebook, Twitter, Wikipedia, Web, etc…
 We can extract interesting information from this data, but first we
need to collect it
 Customized crawlers, use of public APIs
 Additional cleaning/processing to parse out the useful parts
 Respect of crawling etiquette
Dimensionality Reduction
Each record has many attributes
◼ useful, useless or correlated

Then,

Can we select some small subset of attributes?

Dimensionality Reduction can do this….


Dimensionality Reduction
 Why?
 When dimensionality increases, data becomes increasingly
sparse in the space that it occupies

 Curse of Dimensionality : Definitions of density and distance between points,


which is critical for clustering and outlier detection, become less meaningful

 Objectives:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining
algorithms
 Observation: Certain Dimensions are correlated
Dimensionality
Reduction
 Allow data to be more easily visualized

 May help to eliminate irrelevant features or reduce


noise

 Techniques

 Principle Component Analysis or Singular Value Decomposition


 (Mapping Data to New Space) : Wavelet Transform
 Others: supervised and non-linear techniques
Distance/Similarity
Data has many records

Then,

Can we find similar records?

Distance and Similarity are commonly used….


What is similar?
Shape Colour

Size Pattern
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are.
 Is higher when objects are more alike.
 Often falls in the range [0,1]

 Dissimilarity
 Numerical measure of how different are two data objects
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies

 Proximity refers to a similarity or dissimilarity


Euclidean Distance
 Euclidean Distance

n
dist =  k
( p − qk ) 2
k=1

Where n is the number of dimensions (attributes) and pk and


qk are, respectively, the kth attributes (components) or data
objects p and q.

 Standardization is necessary, if scales differ.


Euclidean Distance (Metric)
Euclidean distance:

Point 1 is: (x1, x2 ,..., xn )


Point 2 is:
( y1 , y2 ,..., y n )
Euclidean distance is:

( y − x )2 + ( y − x )2 + ...+ ( y − x )2
1 1 2 2 n n

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources:
http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Euclidean
Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Sample Correlation Matrix
-1 0 +1

business acreage

nitrous oxide

average # rooms

Data on characteristics
of Boston surburbs
Median house value
percentage of large residential lots
Summary

• Every real world data set needs some kind of data


pre-processing
• Deal with missing values
• Correct erroneous values
• Select relevant attributes
• Adapt data set format to the software tool to be used

• In general, data pre-processing consumes more


than 60% of a data mining project effort
58
Information/Entrop
y
 Given probabilitites p1, p2, .., ps whose sum is 1, Entropy is
defined as:

 Entropy measures the amount of randomness or surprise or


uncertainty.

 Only takes into account non-zero probabilities


Entropy-Based
Discretization
 Given a set of samples S, if S is partitioned into two intervals
S1 and S2 using boundary T, the entropy after partitioning is
| S 1| |S 2|
E(S,T) = Ent( S 1) + Ent( S 2)
|S| |S|

 The boundary that minimizes the entropy function over all


possible boundaries is selected as a binary discretization.
 The process is recursively applied to partitions obtained until
some stopping criterion is met, e.g.,
Ent(S) − E(T, S)  
 Experiments show that it may reduce data size and improve
classification accuracy
•Predictive Data Analytics
Models and Applications
Data Analytics

• Data analytics is the science of analysing raw data in


order to make conclusions about that information.
Geisser (1993) defines
predictive modelling as “the
Predictive process by which a model is
Modelling created or chosen to try to
best predict the probability
of an outcome.”
What is Predictive Modelling
• Predictive analytics is the branch of the advanced analytics which is
used to make predictions about unknown future events.
• Predictive analytics uses many techniques from data mining, statistics,
modeling, machine learning, and artificial intelligence to analyze
• current data to make predictions about future.

 Predictive modeling is a process used in predictive analytics to


create a statistical model of future behavior.

 Predictive analytics is the area of data mining concerned with


forecasting probabilities and trends.
Predictive Analytics Process
Business process and features on Predictive
Modelling

 Business process on Predicting modelling


❖ Creating the model
❖ Testing the model
❖ Validating the model
❖ Evaluating the model

 Features in Predicting modelling


❖ Data analysis and manipulation
❖ Visualization
❖ Statistics
❖ Hypothesis testing
How the model work
✓ In predictive modeling, data is collected for the relevant
predictors, a statistical model is formulated, predictions are made
and the model is validated (or revised) as additional data
becomes available.
✓ The model may employ a simple linear equation or a complex
neural network, mapped out by sophisticated software.
How the model work(cont.)
 Here you will learn what a predictive model is, and how, by actively
guiding marketing campaigns, it constitutes a key form of business
intelligence. we'll take a look inside to see how a model works-

1. Predictors Rank Your Customers to Guide Your Marketing


2.Combined Predictors Means Smarter Rankings
3.The Computer Makes Your Model from Your Customer Data
4.A Simple Curve Shows How Well Your Model Works
5.Conclusions
Why Predictive Modelling
Nearly every business in competitive markets will eventually need to do
predictive modeling to remain ahead of the curve. Predicting Modeling (also
known as Predictive Analytics) is the process of automatically detecting patterns
in data, then using those patterns to foretell some event. Predictive models
are commonly built to predict:
 Customer Relationship Management
 the chance a prospect will respond to an ad
 Mail recipients likely to buy
 when a customer is likely to churn
 if a person is likely to get sick
 Portfolio or Product Prediction
 Risk Management & Pricing
Some Predictive Models

Ideally, these techniques are widely used:


• Linear regression
• Logistic regression
• Regression with regularization
• Neural networks
• Support vector machines
• Naive Bayes models
• K-nearest-neighbors classification
• Decision trees
• Ensembles of trees
• Gradient boosting
Predictive Data Analytics
Models

Regression Classification
Regression

Input ML Regression
Data

e.g. Past Machine Prediction:


House Prices Learning Value
Linear Regression

Given an input x we would like to


Y
compute an output y
In linear regression we assume that y
and x are related with the following
equation:
y = wx+
X
where w is a parameter and  represents
measurement or other noise
Regression Examples
Weather prediction

Exchange rate prediction

x
x

http://www.bom.gov.au/watl/about/about-latest-weather- https://walletinvestor.com/forex-forecast/eur-usd-prediction
graphs.shtml
Classification

Input
ML Classifier
Data

Machine Prediction:
e.g. E-mail text
Learning Category
Classification
x

Data: A set of data records (also called


examples, instances or cases)
described by
k attributes: A1, A2, … Ak. x
a class: Each example is labelled
with a pre- defined class.
Goal: To learn a classification model from
the data that can be used to predict the
classes of new (future, or test)
cases/instances.
Classification Examples

https://towardsdatascience.com/applied-text-classification-on-email-spam- filtering-part-1-1861e1a83246
Regression vs
Classification

Input ML Regression £450


Data

{cheap,
Input affordable,
ML Classification
Data expensive }
Applications of Predictive Modelling
 Analytical customer relationship management (CRM)
 Health Care
 Collection Analytics
 Cross-cell
 Fraud detection
 Risk management

❖ Industry Applications
Predictive modelling are used in insurance, banking, marketing,
financial services, telecommunications, retail, travel, healthcare, oil
& gas and other industries.
Predictive Models in Retail industry
•  Campaign Response Model – this model predicts the
likelihood that a customer responds to a specific campaign by
purchasing a products solicited in the campaign. The model also
predicts the amount of the purchase given response.
➢ Regression models
➢ Customer Segmentation
➢ Cross-Sell and Upsell
➢ New Product Recommendation
➢ Customer Retention/Loyalty/Churn
➢ Inventory Management
• Will this customer move their business to a
different company?

• Does a patient have a specific disease?

• Based on past choices, which movies will interest


this viewer?

• Should I sell this stock?

• Which people should we match in our online dating


service?

• Will this patient respond to this therapy?


Predictive Models in Telecom industry
Campaign analytics
Churn modeling
Cross-selling and up-selling
Customer lifetime value analytics
Customer segmentation
Fraud analytics
Marketing spend optimization
Network optimization
Price optimization
Sales territory optimization
Predictive Analytics Software

SAS Analytics
STATISTICA
IBM Predictive Analytics
MATLAB
Minitab

You might also like