0% found this document useful (0 votes)
21 views35 pages

DM 2 Part 2

Uploaded by

tanaybobbili.129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views35 pages

DM 2 Part 2

Uploaded by

tanaybobbili.129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Mining

LAKSHMI VIVEKA KESANAPALLI


Associate Professor
Dept of CSE (Artificial Intelligence)
Pragati Engineering College

22-02-2024 DM/UNIT-II/LECTURE-3
UNIT – II Part - II

22-02-2024 DM/UNIT-II/LECTURE-3
Syllabus

 Data Preprocessing: Aggregation,


Sampling, Dimensionality Reduction,
Feature Subset Selection, Feature
creation, Discretization and Binarization,
Variable Transformation, Measures of
Similarity and Dissimilarity. (Tan &
Vipin)

22-02-2024 DM/UNIT-II/LECTURE-3
Data Preprocessing
 The techniques that are applied to make the data more
suitable for data mining.
 Data preprocessing is a broad area and consists of
number of different strategies and techniques.
 They are:
 Aggregation
 Sampling
 Dimensionality reduction
 Feature subset selection
 Feature Creation
 Discretization and binarization
 Variable transformation
The goal is to improve the data mining analysis with respect to
time, cost and quality.

22-02-2024 DM/UNIT-II/LECTURE-3
Aggregation
 The combining of two or more objects into
a single object is called as aggregation.
◦ Eg: Aggregating sales to store wide
transactions.
 Quantitative attributes such as price, are
typically aggregated by taking a sum or an
average.
 A Qualitative attribute such as item, can
either be omitted or summarized as the set
of all the items that are sold at that location.
22-02-2024 DM/UNIT-II/LECTURE-3
 Motivations for aggregations:
 Smaller data sets resulting from data aggregation
requires less memory and processing time.
 Aggregation can act as a change of scope or scale
by providing a high level view of data instead of a
low-level view.
 The behaviour of groups of objects is often more
stable than that of individual objects. (Less
variability)
 Disadvantage: Potential loss of interesting details.
 Eg: Aggregating sales over months losses
information about which day of the week has the
highest sales.
22-02-2024 DM/UNIT-II/LECTURE-3
Sampling
 Sampling is a commonly used approach
for selecting a subset of the data objects to
be analyzed.
 The motivation for sampling is that it is
too expensive or time consuming to
process all the data.
 The key principle of effective sampling is
sample should be representative.
 A sample is representative if it has
approximately the same property as the
original set of data.

22-02-2024 DM/UNIT-II/LECTURE-3
Sampling Approaches
 There are many sampling techniques but the
popular techniques are:
1. Simple random sampling
2. Stratified sampling
Simple Random Sampling:
- It is the simplest type of sampling in which
sample is taken in random.
- In this technique, there is an equal probability
of selecting any particular item.
- There are two variations on random sampling
i. Sampling without replacement
ii. Sampling with replacement

22-02-2024 DM/UNIT-II/LECTURE-3
 Sampling without replacement:
As each item is selected it is removed from
the set of all objects (Population)

 Sampling with replacement:


Objects are not removed from the population
as they are selected from the sample.
In sampling with replacement, the same
object can be picked more than once.
Sampling with replacement is simpler to analyze.
Since the probability of selecting any object
remains constant during the sampling process.

22-02-2024 DM/UNIT-II/LECTURE-3
Stratified Sampling
 When the population consists of different
types of objects with widely different
number of objects then simple random
sampling may fail.
 In such cases, we use "Stratified
Sampling".
 In this approach, equal number of objects
are drawn from each group even through
the groups are of different sizes.
 In other variation the number of objects
drawn from each group is proportional to
the size of that group.
22-02-2024 DM/UNIT-II/LECTURE-3
Dimensionality Reduction

 A Key benefit of dimensionality reduction


is that many data mining algorithms work
better if the dimensionality is lower.
 Dimensionality reduction can eliminate
irrelevant features and reduce noise
because of "Curse of Dimensionality".

22-02-2024 DM/UNIT-II/LECTURE-3
Curse of Dimensionality
 If the dimensionality increases, the data becomes
increasingly sparse in the space that it occupies.
 In such cases, data analysis is becoming significantly
harder. This phenomenon is called as "Curse of
Dimensionality".
 Another benefit of dimensionality reduction is that
model can be more understandable, because the model
may involve fewer attributes.
 Dimensionality reduction may allow the data to be more
easily visualized.
 The amount of time and memory required by the data
mining algorithm is reduced with dimensionality
reduction.

22-02-2024 DM/UNIT-II/LECTURE-3
Linear algebra techniques for Dimensionality
reduction:
 Principal component Analysis (PCA):
It is a linear algebra technique for continuous attributes
that finds new attributes (principal components) that
i) are linear combinations of the original attributes.
ii) are orthogonal (perpendicular) to each other.
iii) capture the maximum amount of variation in the
data.

22-02-2024 DM/UNIT-II/LECTURE-3
 Singular Value Decomposition:
It is also a linear algebra technique that
is related to PCA.
It is a method of decomposing a matrix
into three other matrices.
A = USVT
where A is an m x n matrix

U is an m x n orthogonal matrix
S is an n x n diagonal matrix
V is an n x n orthogonal matrix

22-02-2024 DM/UNIT-II/LECTURE-3
Feature Subset Selection
 It reduces the data set size by removing
redundant and irrelevant features.
 Redundant features duplicate much or all of the
information contained in one or more other
attributes.
 Eg: Purchase price of product and the amount of
sales tax (GST) paid contain much of the same
information.
 Irrelevant features contain almost no useful
information for the data mining task at hand.
 Eg: Student roll no is irrelevant to the task of
predicting student CGPA.

22-02-2024 DM/UNIT-II/LECTURE-3
 Redundant and irrelevant features can reduce
classification accuracy and the quality of the
clusters that are found.
(i) Brute force approach: Try all possible subsets of
features as input to the data mining algorithm of interest
and then take the subset that produces the best results.
This approach is impractical as 'n' attributes
have 2n subsets.
(ii) Embedded approach: Feature selection occurs
naturally as part of the data mining algorithm.
During the operation of the data mining algorithm,
the algorithm itself decides which attributes to use and
which to ignore.

22-02-2024 DM/UNIT-II/LECTURE-3
 Filter approach: Features are selected
before the data mining algorithm run.
 Eg: Select set of attributes whose pair-
wise correlation is low.
 Wrapper approach: It uses the target
data mining algorithm as a black box to
find the best subset of attributes.

22-02-2024 DM/UNIT-II/LECTURE-3
Flow chart of a feature subset
selection process

22-02-2024 DM/UNIT-II/LECTURE-3
Feature Creation
 Creating a new set of attributes that
captures the important information in a
data set rather than original attributes.
 Number of new attributes can be smaller
than the number of original attributes.
 There are three methods for creating new
attributes.
(i) Feature extraction
(ii) Mapping the data to a new space
(iii) Feature construction
22-02-2024 DM/UNIT-II/LECTURE-3
(i) Feature extraction: The creation of a new
set of features from the original raw data
is known as feature extraction.
 Eg: Presence or absence of edges instead
of pixels in image processing.
 Feature extraction is highly domain
specific.
(ii) Mapping the data to a new space: A
totally different view of the data can
reveal important and interesting features.
 Eg: Fourier transforms
Wavelet transforms

22-02-2024 DM/UNIT-II/LECTURE-3
 Feature Construction: Sometimes the
features in the original data set have the
necessary information but not in a form
suitable for the data mining algorithm.
 In this situation, one or more features are
constructed out of the original features
can be more useful than the original
features.
 Eg: Density feature constructed from the
mass and volume features i.e.,
density = (mass/volume)

22-02-2024 DM/UNIT-II/LECTURE-3
Discretization and Binarization
 Transforming a continuous attribute into a
categorical attribute is called discretizaton.
 Eg: Converting continuous attribute age into
(youth, middle-aged, senior).
 Transforming both continuous and discrete
attributes into one or more binary attributes is
called as binarization.

22-02-2024 DM/UNIT-II/LECTURE-3
 Discretization methods can be classified
into two categories based on the class
information availablity.
 If there is no class information used for
discretization then it is called as
"unsupervised discretization".
 Eg: Equal width approach
Equal frequency approach
 If the class information is used in
discretization then it is called as
"Supervised discretization"
 Eg: Entropy based approach

22-02-2024 DM/UNIT-II/LECTURE-3
Variable Transformation
 A variable or attribute transformation
refers to a transformation that is applied to
all the values of a variable.
 Eg: If only the magnitude of a variable is
important then the values can be
transformed by taking the absolute
value(|x|).
 There are two important types of variable
transformations.
(i) Simple functional transformations
(ii) Normalization or Standardization

22-02-2024 DM/UNIT-II/LECTURE-3
 Simple functional transformations:
 In this type of transformation, a simple
mathematical function is applied to each
value individually.
 If x is a variable, then examples of such
transformations include xk , logx, ex, √x,
1/x, sinx or |x|.

22-02-2024 DM/UNIT-II/LECTURE-3
 Normalization or Standardization:
 The goal of normalization or
standardization is to make an entire set of
values have a particular property.
 It scales the data in such a way that all
values fall within a specified range such
as 0 to 1 or -1 to 1.
 Eg: min-max normalization
z-score normalization
Decimal scaling

22-02-2024 DM/UNIT-II/LECTURE-3
Measures of similarity and
dissimilarity (Proximity Measures)
 Similarity:
 Similarity between two objects is a
numerical measure of the degree to which
the two objects are alike.
 Similarities are usually non-negative and
are often between 0 (no similarity) and 1
(complete similarity)

22-02-2024 DM/UNIT-II/LECTURE-3
 Dissimilarity:
 Dissimilarity between two objects is a
numerical measure of the degree to which
the two objects are different.
 Frequently, the term "distance" is used as
a synonym for dissimilarity.
 Dissimilarities sometimes, fall in the
interval [0,1] but it is also common for
them to range from o to ∞.
22-02-2024 DM/UNIT-II/LECTURE-3
 Euclidean Distance:
 The distance between two points x and y
in one/two/three/higher dimensional space
is given by

d(x,y) =
where n is the number of dimensions
Eg: x = (2,3) y = (3,4)
d(x,y) = = =
= 1.414

22-02-2024 DM/UNIT-II/LECTURE-3
 Minkowski Distance :

Where r is a parameter.
- If r = 1, then the distance is called as city block or
Manhattan or taxicab or L1 norm distance.
- If r = 2, then the distance is called as Euclidean
distance or L2 norm.
- If r = ∞, then the distance is called as supremum
or Lmax or L∞ norm distance. This is the
maximum difference between any attribute of the
objects.

22-02-2024 DM/UNIT-II/LECTURE-3
 Metric :
- If d(x,y) is the distance between two points x
and y then the following properties hold.
i) Positivity :
a) d(x,y) >= 0 for all x and y
b) d(x,y) = 0 only if x = y
ii) Symmetry :
a) d(x,y) = d(y,x)
iii) Triangle inequality :
a) d(x,z) <= d(x,y) + d(y,z) for all x, y and z
- Measure that satisfy all three properties is
called as metric.

22-02-2024 DM/UNIT-II/LECTURE-3
 Similarity Measures for binary data:
- Similarity measures between objects that
contain only binary attributes are called
similarity co-efficients.
- Let x and y are two binary vectors.
- f00 = no. of attributes where x = 0 and y = 0
- f01 = no. of attributes where x = 0 and y = 1
- f10 = no. of attributes where x = 1 and y = 0
- f11 = no. of attributes where x = 1 and y = 1

22-02-2024 DM/UNIT-II/LECTURE-3
Simple Matching Coefficient :
 SMC = No. of Matching attribute values / No. of attributes
 = (f00 + f11 ) / ( f01 + f10 + f11 + f00 )

This measure counts both presences and absences


equally.
Jaccard Cooefficient :

J = No. of matching presences / No of attributes not involved in


00 mathces.
= f11 / ( f01 + f10 + f11)

22-02-2024 DM/UNIT-II/LECTURE-3
 Example :
 Given data:
p=1000000000
q=0000001001
 f01 = 2 f10 = 1 f11 = 0 f00 = 7

 Simple matching coefficient = (0 + 7) / (0 + 1 +


2 + 7) = 0.7.
 Jaccard coefficient = 0 / (0 + 1 + 2) = 0.

22-02-2024 DM/UNIT-II/LECTURE-3
 Cosine Similarity :
◦ Cosine similarity is one of the most common
measure of document similarity.
◦ If A and B are two document vectors, then

◦ where A and B are the feature vectors of two


data points, "." denotes the dot product, and
"||" denotes the magnitude of the vector.

22-02-2024 DM/UNIT-II/LECTURE-3

You might also like