0% found this document useful (0 votes)

21 views35 pages

DM 2 Part 2

Uploaded by

tanaybobbili.129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views35 pages

DM 2 Part 2

Uploaded by

tanaybobbili.129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Data Mining

LAKSHMI VIVEKA KESANAPALLI

Associate Professor
Dept of CSE (Artificial Intelligence)
Pragati Engineering College

22-02-2024 DM/UNIT-II/LECTURE-3
UNIT – II Part - II

22-02-2024 DM/UNIT-II/LECTURE-3
Syllabus

 Data Preprocessing: Aggregation,

Sampling, Dimensionality Reduction,
Feature Subset Selection, Feature
creation, Discretization and Binarization,
Variable Transformation, Measures of
Similarity and Dissimilarity. (Tan &
Vipin)

22-02-2024 DM/UNIT-II/LECTURE-3
Data Preprocessing
 The techniques that are applied to make the data more
suitable for data mining.
 Data preprocessing is a broad area and consists of
number of different strategies and techniques.
 They are:
 Aggregation
 Sampling
 Dimensionality reduction
 Feature subset selection
 Feature Creation
 Discretization and binarization
 Variable transformation
The goal is to improve the data mining analysis with respect to
time, cost and quality.

22-02-2024 DM/UNIT-II/LECTURE-3
Aggregation
 The combining of two or more objects into
a single object is called as aggregation.
◦ Eg: Aggregating sales to store wide
transactions.
 Quantitative attributes such as price, are
typically aggregated by taking a sum or an
average.
 A Qualitative attribute such as item, can
either be omitted or summarized as the set
of all the items that are sold at that location.
22-02-2024 DM/UNIT-II/LECTURE-3
 Motivations for aggregations:
 Smaller data sets resulting from data aggregation
requires less memory and processing time.
 Aggregation can act as a change of scope or scale
by providing a high level view of data instead of a
low-level view.
 The behaviour of groups of objects is often more
stable than that of individual objects. (Less
variability)
 Disadvantage: Potential loss of interesting details.
 Eg: Aggregating sales over months losses
information about which day of the week has the
highest sales.
22-02-2024 DM/UNIT-II/LECTURE-3
Sampling
 Sampling is a commonly used approach
for selecting a subset of the data objects to
be analyzed.
 The motivation for sampling is that it is
too expensive or time consuming to
process all the data.
 The key principle of effective sampling is
sample should be representative.
 A sample is representative if it has
approximately the same property as the
original set of data.

22-02-2024 DM/UNIT-II/LECTURE-3
Sampling Approaches
 There are many sampling techniques but the
popular techniques are:
1. Simple random sampling
2. Stratified sampling
Simple Random Sampling:
- It is the simplest type of sampling in which
sample is taken in random.
- In this technique, there is an equal probability
of selecting any particular item.
- There are two variations on random sampling
i. Sampling without replacement
ii. Sampling with replacement

22-02-2024 DM/UNIT-II/LECTURE-3
 Sampling without replacement:
As each item is selected it is removed from
the set of all objects (Population)

 Sampling with replacement:

Objects are not removed from the population
as they are selected from the sample.
In sampling with replacement, the same
object can be picked more than once.
Sampling with replacement is simpler to analyze.
Since the probability of selecting any object
remains constant during the sampling process.

22-02-2024 DM/UNIT-II/LECTURE-3
Stratified Sampling
 When the population consists of different
types of objects with widely different
number of objects then simple random
sampling may fail.
 In such cases, we use "Stratified
Sampling".
 In this approach, equal number of objects
are drawn from each group even through
the groups are of different sizes.
 In other variation the number of objects
drawn from each group is proportional to
the size of that group.
22-02-2024 DM/UNIT-II/LECTURE-3
Dimensionality Reduction

 A Key benefit of dimensionality reduction

is that many data mining algorithms work
better if the dimensionality is lower.
 Dimensionality reduction can eliminate
irrelevant features and reduce noise
because of "Curse of Dimensionality".

22-02-2024 DM/UNIT-II/LECTURE-3
Curse of Dimensionality
 If the dimensionality increases, the data becomes
increasingly sparse in the space that it occupies.
 In such cases, data analysis is becoming significantly
harder. This phenomenon is called as "Curse of
Dimensionality".
 Another benefit of dimensionality reduction is that
model can be more understandable, because the model
may involve fewer attributes.
 Dimensionality reduction may allow the data to be more
easily visualized.
 The amount of time and memory required by the data
mining algorithm is reduced with dimensionality
reduction.

22-02-2024 DM/UNIT-II/LECTURE-3
Linear algebra techniques for Dimensionality
reduction:
 Principal component Analysis (PCA):
It is a linear algebra technique for continuous attributes
that finds new attributes (principal components) that
i) are linear combinations of the original attributes.
ii) are orthogonal (perpendicular) to each other.
iii) capture the maximum amount of variation in the
data.

22-02-2024 DM/UNIT-II/LECTURE-3
 Singular Value Decomposition:
It is also a linear algebra technique that
is related to PCA.
It is a method of decomposing a matrix
into three other matrices.
A = USVT
where A is an m x n matrix

U is an m x n orthogonal matrix
S is an n x n diagonal matrix
V is an n x n orthogonal matrix

22-02-2024 DM/UNIT-II/LECTURE-3
Feature Subset Selection
 It reduces the data set size by removing
redundant and irrelevant features.
 Redundant features duplicate much or all of the
information contained in one or more other
attributes.
 Eg: Purchase price of product and the amount of
sales tax (GST) paid contain much of the same
information.
 Irrelevant features contain almost no useful
information for the data mining task at hand.
 Eg: Student roll no is irrelevant to the task of
predicting student CGPA.

22-02-2024 DM/UNIT-II/LECTURE-3
 Redundant and irrelevant features can reduce
classification accuracy and the quality of the
clusters that are found.
(i) Brute force approach: Try all possible subsets of
features as input to the data mining algorithm of interest
and then take the subset that produces the best results.
This approach is impractical as 'n' attributes
have 2n subsets.
(ii) Embedded approach: Feature selection occurs
naturally as part of the data mining algorithm.
During the operation of the data mining algorithm,
the algorithm itself decides which attributes to use and
which to ignore.

22-02-2024 DM/UNIT-II/LECTURE-3
 Filter approach: Features are selected
before the data mining algorithm run.
 Eg: Select set of attributes whose pair-
wise correlation is low.
 Wrapper approach: It uses the target
data mining algorithm as a black box to
find the best subset of attributes.

22-02-2024 DM/UNIT-II/LECTURE-3
Flow chart of a feature subset
selection process

22-02-2024 DM/UNIT-II/LECTURE-3
Feature Creation
 Creating a new set of attributes that
captures the important information in a
data set rather than original attributes.
 Number of new attributes can be smaller
than the number of original attributes.
 There are three methods for creating new
attributes.
(i) Feature extraction
(ii) Mapping the data to a new space
(iii) Feature construction
22-02-2024 DM/UNIT-II/LECTURE-3
(i) Feature extraction: The creation of a new
set of features from the original raw data
is known as feature extraction.
 Eg: Presence or absence of edges instead
of pixels in image processing.
 Feature extraction is highly domain
specific.
(ii) Mapping the data to a new space: A
totally different view of the data can
reveal important and interesting features.
 Eg: Fourier transforms
Wavelet transforms

22-02-2024 DM/UNIT-II/LECTURE-3
 Feature Construction: Sometimes the
features in the original data set have the
necessary information but not in a form
suitable for the data mining algorithm.
 In this situation, one or more features are
constructed out of the original features
can be more useful than the original
features.
 Eg: Density feature constructed from the
mass and volume features i.e.,
density = (mass/volume)

22-02-2024 DM/UNIT-II/LECTURE-3
Discretization and Binarization
 Transforming a continuous attribute into a
categorical attribute is called discretizaton.
 Eg: Converting continuous attribute age into
(youth, middle-aged, senior).
 Transforming both continuous and discrete
attributes into one or more binary attributes is
called as binarization.

22-02-2024 DM/UNIT-II/LECTURE-3
 Discretization methods can be classified
into two categories based on the class
information availablity.
 If there is no class information used for
discretization then it is called as
"unsupervised discretization".
 Eg: Equal width approach
Equal frequency approach
 If the class information is used in
discretization then it is called as
"Supervised discretization"
 Eg: Entropy based approach

22-02-2024 DM/UNIT-II/LECTURE-3
Variable Transformation
 A variable or attribute transformation
refers to a transformation that is applied to
all the values of a variable.
 Eg: If only the magnitude of a variable is
important then the values can be
transformed by taking the absolute
value(|x|).
 There are two important types of variable
transformations.
(i) Simple functional transformations
(ii) Normalization or Standardization

22-02-2024 DM/UNIT-II/LECTURE-3
 Simple functional transformations:
 In this type of transformation, a simple
mathematical function is applied to each
value individually.
 If x is a variable, then examples of such
transformations include xk , logx, ex, √x,
1/x, sinx or |x|.

22-02-2024 DM/UNIT-II/LECTURE-3
 Normalization or Standardization:
 The goal of normalization or
standardization is to make an entire set of
values have a particular property.
 It scales the data in such a way that all
values fall within a specified range such
as 0 to 1 or -1 to 1.
 Eg: min-max normalization
z-score normalization
Decimal scaling

22-02-2024 DM/UNIT-II/LECTURE-3
Measures of similarity and
dissimilarity (Proximity Measures)
 Similarity:
 Similarity between two objects is a
numerical measure of the degree to which
the two objects are alike.
 Similarities are usually non-negative and
are often between 0 (no similarity) and 1
(complete similarity)

22-02-2024 DM/UNIT-II/LECTURE-3
 Dissimilarity:
 Dissimilarity between two objects is a
numerical measure of the degree to which
the two objects are different.
 Frequently, the term "distance" is used as
a synonym for dissimilarity.
 Dissimilarities sometimes, fall in the
interval [0,1] but it is also common for
them to range from o to ∞.
22-02-2024 DM/UNIT-II/LECTURE-3
 Euclidean Distance:
 The distance between two points x and y
in one/two/three/higher dimensional space
is given by

d(x,y) =
where n is the number of dimensions
Eg: x = (2,3) y = (3,4)
d(x,y) = = =
= 1.414

22-02-2024 DM/UNIT-II/LECTURE-3
 Minkowski Distance :

Where r is a parameter.
- If r = 1, then the distance is called as city block or
Manhattan or taxicab or L1 norm distance.
- If r = 2, then the distance is called as Euclidean
distance or L2 norm.
- If r = ∞, then the distance is called as supremum
or Lmax or L∞ norm distance. This is the
maximum difference between any attribute of the
objects.

22-02-2024 DM/UNIT-II/LECTURE-3
 Metric :
- If d(x,y) is the distance between two points x
and y then the following properties hold.
i) Positivity :
a) d(x,y) >= 0 for all x and y
b) d(x,y) = 0 only if x = y
ii) Symmetry :
a) d(x,y) = d(y,x)
iii) Triangle inequality :
a) d(x,z) <= d(x,y) + d(y,z) for all x, y and z
- Measure that satisfy all three properties is
called as metric.

22-02-2024 DM/UNIT-II/LECTURE-3
 Similarity Measures for binary data:
- Similarity measures between objects that
contain only binary attributes are called
similarity co-efficients.
- Let x and y are two binary vectors.
- f00 = no. of attributes where x = 0 and y = 0
- f01 = no. of attributes where x = 0 and y = 1
- f10 = no. of attributes where x = 1 and y = 0
- f11 = no. of attributes where x = 1 and y = 1

22-02-2024 DM/UNIT-II/LECTURE-3
Simple Matching Coefficient :
 SMC = No. of Matching attribute values / No. of attributes
 = (f00 + f11 ) / ( f01 + f10 + f11 + f00 )

This measure counts both presences and absences

equally.
Jaccard Cooefficient :

J = No. of matching presences / No of attributes not involved in

00 mathces.
= f11 / ( f01 + f10 + f11)

22-02-2024 DM/UNIT-II/LECTURE-3
 Example :
 Given data:
p=1000000000
q=0000001001
 f01 = 2 f10 = 1 f11 = 0 f00 = 7

 Simple matching coefficient = (0 + 7) / (0 + 1 +

2 + 7) = 0.7.
 Jaccard coefficient = 0 / (0 + 1 + 2) = 0.

22-02-2024 DM/UNIT-II/LECTURE-3
 Cosine Similarity :
◦ Cosine similarity is one of the most common
measure of document similarity.
◦ If A and B are two document vectors, then

◦ where A and B are the feature vectors of two

data points, "." denotes the dot product, and
"||" denotes the magnitude of the vector.

22-02-2024 DM/UNIT-II/LECTURE-3

Goal Statement
No ratings yet
Goal Statement
1 page
CRISP-DM Template Final Project
No ratings yet
CRISP-DM Template Final Project
13 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Unit 2 Part 4
No ratings yet
Unit 2 Part 4
47 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Data
No ratings yet
Data
36 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Module 2
No ratings yet
Module 2
42 pages
Week 2
No ratings yet
Week 2
96 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Datascience
No ratings yet
Datascience
28 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Data Preprocessing-2
No ratings yet
Data Preprocessing-2
30 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Normalization
No ratings yet
Normalization
35 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
DR
No ratings yet
DR
20 pages
DM Ch3 Data Preprocessing
No ratings yet
DM Ch3 Data Preprocessing
45 pages
DM 2 Part 1
No ratings yet
DM 2 Part 1
50 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
02 DataPreparation
No ratings yet
02 DataPreparation
43 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
No ratings yet
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
7 pages
Down 2
No ratings yet
Down 2
61 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
8086 Instruction Set
No ratings yet
8086 Instruction Set
50 pages
C 3 Kernel 3 v2.11 June 2023
No ratings yet
C 3 Kernel 3 v2.11 June 2023
150 pages
Criminal Law
No ratings yet
Criminal Law
11 pages
Oxford Exam Excellence Recording 26
No ratings yet
Oxford Exam Excellence Recording 26
1 page
Ravi Teja Resume
No ratings yet
Ravi Teja Resume
2 pages
KRSP100&125HP Kaishan
100% (1)
KRSP100&125HP Kaishan
46 pages
Ankit Pandey
No ratings yet
Ankit Pandey
9 pages
Long Term Water Repellent Treatment For External Masonry: Belzona® 5122
No ratings yet
Long Term Water Repellent Treatment For External Masonry: Belzona® 5122
2 pages
White Paper GFM Functional Specification
No ratings yet
White Paper GFM Functional Specification
51 pages
Green and Black Minimalist Resume
No ratings yet
Green and Black Minimalist Resume
2 pages
What Is NumPy
No ratings yet
What Is NumPy
5 pages
Stock Statement
No ratings yet
Stock Statement
4 pages
Sathish Yellanki: Skyess: in Association With
No ratings yet
Sathish Yellanki: Skyess: in Association With
12 pages
Limits, Continuity & Differentiability - DPP 04 - Lakshya JEE AIR O1 (2026)
No ratings yet
Limits, Continuity & Differentiability - DPP 04 - Lakshya JEE AIR O1 (2026)
3 pages
Lecture 7 - Project Crashing
No ratings yet
Lecture 7 - Project Crashing
21 pages
Ind. Partnership Act Case Studies
No ratings yet
Ind. Partnership Act Case Studies
10 pages
Landslide Cameron Highland
No ratings yet
Landslide Cameron Highland
14 pages
P40-01-F21-1 Fire Safety Maintenance Plan & Log - Weekly (Enf, 21.01.01)
No ratings yet
P40-01-F21-1 Fire Safety Maintenance Plan & Log - Weekly (Enf, 21.01.01)
3 pages
DB2 Security: Sample Questions
No ratings yet
DB2 Security: Sample Questions
3 pages
01 - ITIL Patch Management Best Practices
No ratings yet
01 - ITIL Patch Management Best Practices
4 pages
Information About Netbook Axioo Neon CNW
0% (1)
Information About Netbook Axioo Neon CNW
16 pages
LabTech Software - Remote Monitoring & Management Blue Software Appin
No ratings yet
LabTech Software - Remote Monitoring & Management Blue Software Appin
2 pages
Egypt Vision 2030 EnglishDigitalUse
No ratings yet
Egypt Vision 2030 EnglishDigitalUse
209 pages
5 Year Procurement Projection 30032023
No ratings yet
5 Year Procurement Projection 30032023
26 pages
Commercial Banks: Sector Update
No ratings yet
Commercial Banks: Sector Update
19 pages
Multiplying and Dividing Integers 4x4 Puzzle: Math Made Fun! Great For Formative Assessments!
No ratings yet
Multiplying and Dividing Integers 4x4 Puzzle: Math Made Fun! Great For Formative Assessments!
5 pages
Yarber File 1
No ratings yet
Yarber File 1
29 pages
Heartfulnessreport
No ratings yet
Heartfulnessreport
8 pages

DM 2 Part 2

Uploaded by

DM 2 Part 2

Uploaded by

Data Mining

LAKSHMI VIVEKA KESANAPALLI

 Data Preprocessing: Aggregation,

 Sampling with replacement:

 A Key benefit of dimensionality reduction

This measure counts both presences and absences

J = No. of matching presences / No of attributes not involved in

 Simple matching coefficient = (0 + 7) / (0 + 1 +

◦ where A and B are the feature vectors of two

You might also like