0% found this document useful (0 votes)

16 views85 pages

6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024

Uploaded by

Rahul tater

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views85 pages

6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024

Uploaded by

Rahul tater

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

IoT Domain Analyst

(ECE3502)
Module-2

Dr. Biswajit Dwivedy

School of Electronics Engineering

VIT University, Vellore, India
Module 2

predictive models
tag data to pre-
Tags to organize for forecasting,
process large
data Application of
datasets
predictive models
Data pre-processing – “an important milestone
of the Data Mining Process”
✓ Data mining is a process of discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems.
✓ Data mining is an interdisciplinary subfield of computer
science and statistics with an overall goal to extract information (with
intelligent methods) from a data set and transform the information into a
comprehensible structure for further use.
Data analysis pipeline
 Mining is not the only step in the analysis process

Data Result
Preprocessing Data Mining Post-processing

 Preprocessing: real data is noisy, incomplete and inconsistent.

Data cleaning is required to make sense of the data
◼ Techniques: Sampling, Dimensionality Reduction, Feature
Selection.
 Post-Processing: Make the data actionable and useful to the
user : Statistical analysis of importance & Visualization.
Major Tasks in Data Preparation
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results
5
Data Preprocessing
 Attribute Values
 Attribute Transformation
 Normalization (Standardization)
 Aggregation
 Discretization

 Sampling
 Dimensionality Reduction
 Feature subset selection
 Distance/Similarity Calculation
 Visualization
Data Preparation as a step in the Knowledge
Discovery Process
Knowledge
Evaluation and
Presentation

Data Mining

Selection and
Transformation

Cleaning and
Integration
DW

DB
7
Why Prepare Data?
• Data need to be formatted for a given software tool
• Data need to be made adequate for a given method
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”, Age=“222”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
8
• e.g., Endereço: travessa da Igreja de Nevogilde Freguesia: Paranhos
Attribute Values

Data is described using attribute values

Attribute Values
 Attribute values are numbers or symbols assigned to an
attribute
 Distinction between attributes and attribute values
 Same attribute can be mapped to different attribute values
◼ Example: height can be measured in feet or meters

 Different attributes can be mapped to the same set of

values
◼ Example: Attribute values for ID and age are integers
◼ But properties of attribute values can be different
◼ ID has no limit but age has a maximum and minimum value
Types of Attributes
 There are different types of attributes
 Nominal
◼ Examples: ID numbers, eye color, zip codes
 Ordinal
◼ Examples: rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
 Interval
◼ Examples: calendar dates
 Ratio
◼ Examples: length, time, counts
• Nominal scale

content
More information
• Categorical scale Qualitative

• Ordinal scale

• Interval scale
Quantitative
• Ratio scale
Discrete or Continuous
Discrete and Continuous Attributes
 Discrete Attribute
 Has only a finite or countable infinite set of values
 Examples: zip codes, counts, or the set of words in a
collection of documents
 Often represented as integer variables.

 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented
using a finite number of digits.
Data Quality

Data has attribute values

Then,

How good our Data w.r.t. these attribute values?

Data Quality
 Examples of data quality problems:
 Noise and outliers Tid Refund Marital Taxable
Cheat
 Missing values
Status Income

1 Yes Single 125K No

 Duplicate data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

6 No NULL 60K No

Missing values 7 Yes Divorced 220K NULL

8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Data Quality: Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on
a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Data Quality: Outliers
 Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
Outliers
• Outliers are values thought to be out of range.
• “An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a
different mechanism”
• Can be detected by standardizing observations and label the standardized values
outside a predetermined bound as outliers
• Outlier detection can be used for fraud detection or data cleaning

• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem
Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an outlier if outside limits
(normal distribution assumed)

(x − ks, x + ks)
Outlier detection
• Multivariate

• Clustering
• Very small clusters are outliers

20
21
Recommended reading

Only with hard work

and a favorable
context you will have
the chance to become
an outlier!!!
Data Quality: Missing
Values
 Reasons for missing values
 Information is not collected
(e.g., people decline to give their age and weight)
 Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values

 Eliminate Data Objects
 Estimate Missing Values
 Ignore the Missing Value During Analysis
 Replace with all possible values (weighted by their
probabilities)
Data Quality: Duplicate Data

 Data set may include data objects that are duplicates,

or almost duplicates of one another
 Major issue when merging data from heterogeous sources

 Examples:
 Same person with multiple email addresses

 Data cleaning
 Process of dealing with duplicate data issues
Data Quality: Handle
Noise(Binning)
 Binning
 sort data and partition into (equi-depth) bins
 smooth by bin means, bin median, bin boundaries, etc.

 Regression
 smooth by fitting a regression function
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values automatically and check by human
Data Quality: Handle
Noise(Binning)
 Equal-width binning
 Divides the range into N intervals of equal size
 Width of intervals:

 Simple

 Outliers may dominate result

 Equal-depth binning
 Divides the range into N intervals,
each containing approximately same number of records
 Skewed data is also handled well
Simple Methods:
Binning
Example: customer ages number
of values

Equi-width
binning: 10-20 20-30 30-40 40-50 50-60 60-70 70-80
0-10

Equi-depth
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well.

 Equal-depth (frequency) partitioning:

 Divides the range into N intervals, each containing approximately same
number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Data Quality: Handle
Noise(Binning)
Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
* Partition into three (equi-depth) bins
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Quality: Handle
Noise(Regression)
• Replace noisy or
y
missing values by
predicted values
Y1
• Requires model of
attribute dependencies
(maybe wrong!) Y1’ y=x+1

• Can be used for data

smoothing or for
X1 x
handling missing data
Data Quality
There are many more noise handling techniques
….
> Imputation
Data
Transformation
Data has an attribute values

Then,

Can we compare these attribute values?

For Example: Compare following two records

(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg)
Vs.
(3) (5.9 ft, 50 Kg)
(4) (5.6 ft, 56 Kg)

We need Data Transformation to makes different

dimension(attribute) records comparable …
Data Transformation
Techniques
 Normalization: scaled to fall within a small, specified range.
 min-max normalization
 z-score normalization .... Z-scores are a way to compare results to a “normal”
population.
 normalization by decimal scaling

 Centralization:
 Based on fitting a distribution to the data
 Distance function between distributions
◼ KL (Kullberg-Leibler) Distance
◼ Mean Centering
Data Transformation:
Normalization
 min-max normalization
v − min
v'= (new _ max − new _ min) + new _ min
max − min
 z-score normalization
v − mean
v' =
stand _ dev
 normalization by decimal scaling

v
v'= j Where j is the smallest integer such that Max(| v' |)<1
10
Example: Data Normalization
Data Transformation: Aggregation
 Combining two or more attributes (or objects) into a
single attribute (or object)

 Purpose
 Data reduction
◼ Reduce the number of attributes or objects
 Change of scale
◼ Cities aggregated into regions, states, countries, etc
 More “stable” data
◼ Aggregated data tends to have less variability
Data Transformation:
Discretization
 Motivation for Discretization

 Some data mining algorithms only accept categorical

attributes

 May improve understandability of patterns

Data Transformation:
Discretization
 Task
 Reduce the number of values for a given continuous attribute
by partitioning the range of the attribute into intervals
 Interval labels replace actual attribute values

 Methods
• Binning (as explained earlier)

• Cluster analysis

• Entropy-based Discretization (Supervised)

Data Sampling
Data may be Big

Then,

Can we make is it Small by selecting some part of it?

Data Sampling can do this…

“Sampling is the main technique employed for data selection.”

Data Sampling

Sampled Data

Big Data
Data Sampling
 Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.
 Example: What is the average height of a person in Ioannina?
 We cannot measure the height of everybody

 Sampling is used in data mining because processing the entire set of

data of interest is too expensive or time consuming.
 Example: We have 1M documents. What fraction has at least 100 words in
common?
 Computing number of common words for all pairs requires
10^12 comparisons
Data Sampling …
 The key principle for effective sampling is the following:

 Using a sample will work almost as well as using the entire data
sets, if the sample is representative

 A sample is representative if it has approximately the same

property (of interest) as the original set of data

 Otherwise we say that the sample introduces some bias

 What happens if we take a sample from the university campus to

compute the average height of a person at Vellore?
Types of Sampling
 Simple Random Sampling
 There is an equal probability of selecting any particular item

 Sampling without replacement

 As each item is selected, it is removed from the population

 Sampling with replacement

 Objects are not removed from the population as they are selected for the sample.
◼ In sampling with replacement, the same object can be picked up more than once

 Stratified sampling
 Split the data into several partitions; then draw random samples from each
partition
Types of Sampling
 Simple Random Sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 As each item is selected, it is removed from the population
 Sampling with replacement
 Objects are not removed from the population as they are selected for the
sample.
◼ In sampling with replacement, the same object can be picked up more than once. This makes
analytical computation of probabilities easier
◼ E.g., we have 100 people, 51 are women P(W) = 0.51, 49 men
P(M) = 0.49. If I pick two persons what is the probability P(W,W)
that both are women?
◼ Sampling with replacement: P(W,W) = 0.51
2
◼ Sampling without replacement: P(W,W) = 51/100 * 50/99
Sample Size

8000 points 2000 Points 500 Points

Example: Data Collection
Data Collection

Data Result
Data Mining
Preprocessing Post-processing

 Today there is an abundance of data online

 Facebook, Twitter, Wikipedia, Web, etc…
 We can extract interesting information from this data, but first we
need to collect it
 Customized crawlers, use of public APIs
 Additional cleaning/processing to parse out the useful parts
 Respect of crawling etiquette
Dimensionality Reduction
Each record has many attributes
◼ useful, useless or correlated

Then,

Can we select some small subset of attributes?

Dimensionality Reduction can do this….

Dimensionality Reduction
 Why?
 When dimensionality increases, data becomes increasingly
sparse in the space that it occupies

 Curse of Dimensionality : Definitions of density and distance between points,

which is critical for clustering and outlier detection, become less meaningful

 Objectives:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining
algorithms
 Observation: Certain Dimensions are correlated
Dimensionality
Reduction
 Allow data to be more easily visualized

 May help to eliminate irrelevant features or reduce

noise

 Techniques

 Principle Component Analysis or Singular Value Decomposition

 (Mapping Data to New Space) : Wavelet Transform
 Others: supervised and non-linear techniques
Distance/Similarity
Data has many records

Then,

Can we find similar records?

Distance and Similarity are commonly used….

What is similar?
Shape Colour

Size Pattern
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are.
 Is higher when objects are more alike.
 Often falls in the range [0,1]

 Dissimilarity
 Numerical measure of how different are two data objects
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies

 Proximity refers to a similarity or dissimilarity

Euclidean Distance
 Euclidean Distance

n
dist =  k
( p − qk ) 2
k=1

Where n is the number of dimensions (attributes) and pk and

qk are, respectively, the kth attributes (components) or data
objects p and q.

 Standardization is necessary, if scales differ.

Euclidean Distance (Metric)
Euclidean distance:

Point 1 is: (x1, x2 ,..., xn )

Point 2 is:
( y1 , y2 ,..., y n )
Euclidean distance is:

( y − x )2 + ( y − x )2 + ...+ ( y − x )2
1 1 2 2 n n

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources:
http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Euclidean
Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Sample Correlation Matrix
-1 0 +1

business acreage

nitrous oxide

average # rooms

Data on characteristics
of Boston surburbs
Median house value
percentage of large residential lots
Summary

• Every real world data set needs some kind of data

pre-processing
• Deal with missing values
• Correct erroneous values
• Select relevant attributes
• Adapt data set format to the software tool to be used

• In general, data pre-processing consumes more

than 60% of a data mining project effort
58
Information/Entrop
y
 Given probabilitites p1, p2, .., ps whose sum is 1, Entropy is
defined as:

 Entropy measures the amount of randomness or surprise or

uncertainty.

 Only takes into account non-zero probabilities

Entropy-Based
Discretization
 Given a set of samples S, if S is partitioned into two intervals
S1 and S2 using boundary T, the entropy after partitioning is
| S 1| |S 2|
E(S,T) = Ent( S 1) + Ent( S 2)
|S| |S|

 The boundary that minimizes the entropy function over all

possible boundaries is selected as a binary discretization.
 The process is recursively applied to partitions obtained until
some stopping criterion is met, e.g.,
Ent(S) − E(T, S)  
 Experiments show that it may reduce data size and improve
classification accuracy
•Predictive Data Analytics
Models and Applications
Data Analytics

• Data analytics is the science of analysing raw data in

order to make conclusions about that information.
Geisser (1993) defines
predictive modelling as “the
Predictive process by which a model is
Modelling created or chosen to try to
best predict the probability
of an outcome.”
What is Predictive Modelling
• Predictive analytics is the branch of the advanced analytics which is
used to make predictions about unknown future events.
• Predictive analytics uses many techniques from data mining, statistics,
modeling, machine learning, and artificial intelligence to analyze
• current data to make predictions about future.

 Predictive modeling is a process used in predictive analytics to

create a statistical model of future behavior.

 Predictive analytics is the area of data mining concerned with

forecasting probabilities and trends.
Predictive Analytics Process
Business process and features on Predictive
Modelling

 Business process on Predicting modelling

❖ Creating the model
❖ Testing the model
❖ Validating the model
❖ Evaluating the model

 Features in Predicting modelling

❖ Data analysis and manipulation
❖ Visualization
❖ Statistics
❖ Hypothesis testing
How the model work
✓ In predictive modeling, data is collected for the relevant
predictors, a statistical model is formulated, predictions are made
and the model is validated (or revised) as additional data
becomes available.
✓ The model may employ a simple linear equation or a complex
neural network, mapped out by sophisticated software.
How the model work(cont.)
 Here you will learn what a predictive model is, and how, by actively
guiding marketing campaigns, it constitutes a key form of business
intelligence. we'll take a look inside to see how a model works-

1. Predictors Rank Your Customers to Guide Your Marketing

2.Combined Predictors Means Smarter Rankings
3.The Computer Makes Your Model from Your Customer Data
4.A Simple Curve Shows How Well Your Model Works
5.Conclusions
Why Predictive Modelling
Nearly every business in competitive markets will eventually need to do
predictive modeling to remain ahead of the curve. Predicting Modeling (also
known as Predictive Analytics) is the process of automatically detecting patterns
in data, then using those patterns to foretell some event. Predictive models
are commonly built to predict:
 Customer Relationship Management
 the chance a prospect will respond to an ad
 Mail recipients likely to buy
 when a customer is likely to churn
 if a person is likely to get sick
 Portfolio or Product Prediction
 Risk Management & Pricing
Some Predictive Models

Ideally, these techniques are widely used:

• Linear regression
• Logistic regression
• Regression with regularization
• Neural networks
• Support vector machines
• Naive Bayes models
• K-nearest-neighbors classification
• Decision trees
• Ensembles of trees
• Gradient boosting
Predictive Data Analytics
Models

Regression Classification
Regression

Input ML Regression
Data

e.g. Past Machine Prediction:

House Prices Learning Value
Linear Regression

Given an input x we would like to

Y
compute an output y
In linear regression we assume that y
and x are related with the following
equation:
y = wx+
X
where w is a parameter and  represents
measurement or other noise
Regression Examples
Weather prediction

Exchange rate prediction

x
x

http://www.bom.gov.au/watl/about/about-latest-weather- https://walletinvestor.com/forex-forecast/eur-usd-prediction
graphs.shtml
Classification

Input
ML Classifier
Data

Machine Prediction:
e.g. E-mail text
Learning Category
Classification
x

Data: A set of data records (also called

examples, instances or cases)
described by
k attributes: A1, A2, … Ak. x
a class: Each example is labelled
with a pre- defined class.
Goal: To learn a classification model from
the data that can be used to predict the
classes of new (future, or test)
cases/instances.
Classification Examples

https://towardsdatascience.com/applied-text-classification-on-email-spam- filtering-part-1-1861e1a83246
Regression vs
Classification

Input ML Regression £450

Data

{cheap,
Input affordable,
ML Classification
Data expensive }
Applications of Predictive Modelling
 Analytical customer relationship management (CRM)
 Health Care
 Collection Analytics
 Cross-cell
 Fraud detection
 Risk management

❖ Industry Applications
Predictive modelling are used in insurance, banking, marketing,
financial services, telecommunications, retail, travel, healthcare, oil
& gas and other industries.
Predictive Models in Retail industry
•  Campaign Response Model – this model predicts the
likelihood that a customer responds to a specific campaign by
purchasing a products solicited in the campaign. The model also
predicts the amount of the purchase given response.
➢ Regression models
➢ Customer Segmentation
➢ Cross-Sell and Upsell
➢ New Product Recommendation
➢ Customer Retention/Loyalty/Churn
➢ Inventory Management
• Will this customer move their business to a
different company?

• Does a patient have a specific disease?

• Based on past choices, which movies will interest

this viewer?

• Should I sell this stock?

• Which people should we match in our online dating

service?

• Will this patient respond to this therapy?

Predictive Models in Telecom industry
Campaign analytics
Churn modeling
Cross-selling and up-selling
Customer lifetime value analytics
Customer segmentation
Fraud analytics
Marketing spend optimization
Network optimization
Price optimization
Sales territory optimization
Predictive Analytics Software

SAS Analytics
STATISTICA
IBM Predictive Analytics
MATLAB
Minitab

ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Unit I
No ratings yet
Unit I
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Use of ICT in Automobile Industry
100% (3)
Use of ICT in Automobile Industry
3 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
L2 Data Preparation
No ratings yet
L2 Data Preparation
18 pages
Week2 2
No ratings yet
Week2 2
25 pages
Unit 2
No ratings yet
Unit 2
37 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
CH 3
No ratings yet
CH 3
68 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Manual Roche Cobas B 221
No ratings yet
Manual Roche Cobas B 221
360 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Mining
No ratings yet
Data Mining
40 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Unit - II
No ratings yet
Unit - II
56 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
3BSE041037-601 - en Compact HMI 6.0.1 Product Guide
No ratings yet
3BSE041037-601 - en Compact HMI 6.0.1 Product Guide
86 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Normalization
No ratings yet
Normalization
35 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Python PYQ
No ratings yet
Python PYQ
10 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
STULZ E2 Controller Operation Manual OZU0037M
No ratings yet
STULZ E2 Controller Operation Manual OZU0037M
82 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Manisha's Journey
No ratings yet
Manisha's Journey
6 pages
TIA14 SP1 SitePack Torrent
No ratings yet
TIA14 SP1 SitePack Torrent
2 pages
31503922-MA5105 Configuration Guide - (V100R010 - 02)
No ratings yet
31503922-MA5105 Configuration Guide - (V100R010 - 02)
254 pages
Structured Network Cabling Baguio
No ratings yet
Structured Network Cabling Baguio
5 pages
Introduction To HVDC Architecture and Solutions For Control and Protection
No ratings yet
Introduction To HVDC Architecture and Solutions For Control and Protection
18 pages
Extra Worksheets 1st Year
No ratings yet
Extra Worksheets 1st Year
41 pages
Agile Unit-5
No ratings yet
Agile Unit-5
26 pages
The Future of Cybersecurity - Emerging Trends and Challenges
No ratings yet
The Future of Cybersecurity - Emerging Trends and Challenges
5 pages
(ET) Remote Utilities (Viewer + Host) Pro 6.8.0.1 TORRENT (v6.8.0
No ratings yet
(ET) Remote Utilities (Viewer + Host) Pro 6.8.0.1 TORRENT (v6.8.0
5 pages
Saa-C01 V14.35
No ratings yet
Saa-C01 V14.35
112 pages
What's Up CAPTCHA - A CAPTCHA Based On Image Orientation
No ratings yet
What's Up CAPTCHA - A CAPTCHA Based On Image Orientation
10 pages
Unit 1 Sách ĐT5
No ratings yet
Unit 1 Sách ĐT5
18 pages
Falcon 8x SB516
No ratings yet
Falcon 8x SB516
13 pages
Shred1.06 Manual
No ratings yet
Shred1.06 Manual
12 pages
TWITTER
No ratings yet
TWITTER
2 pages
Insurance Software Solutions
No ratings yet
Insurance Software Solutions
8 pages
Data Sheet 6ES7331-7NF00-0AB0: Input Current
No ratings yet
Data Sheet 6ES7331-7NF00-0AB0: Input Current
3 pages
Dataflair FTPO Free Certification Courses
No ratings yet
Dataflair FTPO Free Certification Courses
14 pages
Arithmetic 2 Teacher Edition
No ratings yet
Arithmetic 2 Teacher Edition
8 pages
De La Salle University - Dasmariñas: 1 Semester / Midterm Period / S.Y. 2020-2021
No ratings yet
De La Salle University - Dasmariñas: 1 Semester / Midterm Period / S.Y. 2020-2021
2 pages
Math Quad
No ratings yet
Math Quad
4 pages
Nse 3.1
No ratings yet
Nse 3.1
4 pages
Aaron Willette: Contact - (734) 680-4127 Github
No ratings yet
Aaron Willette: Contact - (734) 680-4127 Github
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet

6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024

Uploaded by

6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024

Uploaded by

IoT Domain Analyst

Dr. Biswajit Dwivedy

School of Electronics Engineering

 Preprocessing: real data is noisy, incomplete and inconsistent.

Data is described using attribute values

 Different attributes can be mapped to the same set of

Data has attribute values

How good our Data w.r.t. these attribute values?

1 Yes Single 125K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

Missing values 7 Yes Divorced 220K NULL

Two Sine Waves Two Sine Waves + Noise

Only with hard work

 Handling missing values

 Data set may include data objects that are duplicates,

 Outliers may dominate result

 Equal-depth (frequency) partitioning:

• Can be used for data

Can we compare these attribute values?

For Example: Compare following two records

We need Data Transformation to makes different

 Some data mining algorithms only accept categorical

 May improve understandability of patterns

• Entropy-based Discretization (Supervised)

Can we make is it Small by selecting some part of it?

Data Sampling can do this…

“Sampling is the main technique employed for data selection.”

 Sampling is used in data mining because processing the entire set of

 A sample is representative if it has approximately the same

 Otherwise we say that the sample introduces some bias

 What happens if we take a sample from the university campus to

 Sampling without replacement

 Sampling with replacement

8000 points 2000 Points 500 Points

 Today there is an abundance of data online

Can we select some small subset of attributes?

Dimensionality Reduction can do this….

 Curse of Dimensionality : Definitions of density and distance between points,

 May help to eliminate irrelevant features or reduce

 Principle Component Analysis or Singular Value Decomposition

Can we find similar records?

Distance and Similarity are commonly used….

 Proximity refers to a similarity or dissimilarity

Where n is the number of dimensions (attributes) and pk and

 Standardization is necessary, if scales differ.

Point 1 is: (x1, x2 ,..., xn )

• Every real world data set needs some kind of data

• In general, data pre-processing consumes more

 Entropy measures the amount of randomness or surprise or

 Only takes into account non-zero probabilities

 The boundary that minimizes the entropy function over all

• Data analytics is the science of analysing raw data in

 Predictive modeling is a process used in predictive analytics to

 Predictive analytics is the area of data mining concerned with

 Business process on Predicting modelling

 Features in Predicting modelling

1. Predictors Rank Your Customers to Guide Your Marketing

Ideally, these techniques are widely used:

e.g. Past Machine Prediction:

Given an input x we would like to

Exchange rate prediction

Data: A set of data records (also called

Input ML Regression £450

• Does a patient have a specific disease?

• Based on past choices, which movies will interest

• Should I sell this stock?

• Which people should we match in our online dating

• Will this patient respond to this therapy?

You might also like