0% found this document useful (0 votes)

9 views27 pages

Class 1c - DataFundamentals

The document discusses key aspects of data mining and knowledge discovery, focusing on the types of data attributes, including categorical, numeric, discrete, and continuous attributes. It also covers methods for measuring similarity and dissimilarity between data objects, such as Euclidean distance, cosine similarity, and correlation measures. Additionally, it addresses normalization techniques and their importance in preparing data for analysis.

Uploaded by

eltcarva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views27 pages

Class 1c - DataFundamentals

Uploaded by

eltcarva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Prof.

Heitor S Lopes
Prof. Thiago H Silva

Data Mining &

Knowledge Discovery

1c - Data - Important Aspects

Data -> Knowledge
Appropriate languages help
In this course examples in:
Dataframe
Support variables of various types and simplify manipulation

Can have
diﬀerent types
in each column
Dataset
Collection of data objects and their attributes

An attribute is a property of an object

Examples: eye color, temperature, etc.

Attribute is also known as variable, characteristic, feature

Attribute types
Categorical They are just diﬀerent names

Nominal
Suﬃcient information to order
Ex: ID number, eye color, zip code
Ordinal
Ex: grades, height {high, medium, low}
Numeric

Data has a natural zero point

Interval, Ratio (signiﬁcant).

Allows comparisons of the

Dates, temperature in Celsius type (x is twice as much as y)
(intervals between each value
are equally divided.) Monetary amounts, weight.
Attribute types
User ID in an e-mail system

– Nominal, Ordinal, or Interval?

Attribute types
Discrete attribute
● Has only a finite or countably infinite set of values
● Ex: zip codes or the set of words in a collection
● Typically represented as integer variables
Continuous attributes
● Has real numbers as attribute values
● Ex: temperature, height or weight.
● Typically represented as floating point variables

Is age continuous or discrete?

Typical and complex datasets
Matrix data
Structured Text: DNA/Protein Sequences
Complex datasets
Transactions

A special type of record, where:

● Each record (transaction) involves a set of items
● Ex: supermarket
Complex datasets
Unstructured text: Spatio-temporal data:
Complex datasets
Time series Graph:
Proximity notion
Measure of similarity
● Numerical measure of how similar two data objects are
● It is larger when they are more similar
● Usually in the range [0,1]

Measure of dissimilarity
● Numerical measure of how diﬀerent two objects are
● Minimal dissimilarity is usually 0

For convenience, proximity refers to similarity or dissimilarity

Euclidean distance

where n is the number of dimensions (attributes) and xk and yk are the kth
attributes of data objects x and y.

Standardization is necessary if the scale diﬀers

Euclidean distance

Distance matrix
Comparison

Distance matrix
Similarity between binary vectors
Common situation: objects p and q have only binary attributes
Compute the similarity like this:
f01 = # of attributes where p was 0 and q is 1
f10 = # of attributes where p was 1 and q is 0
f00 = # of attributes where p was 0 and q is 0
f11 = # of attributes where p was 1 and q is 1
Simple Matching (SMC) and Jaccard Coeﬃcient (J)
SMC = number of matches “11” and “00” / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of matches “11” / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC vs Jaccard
x= 1000000000
y= 0000001001

f01 = 2
f10 = 1
f00 = 7
f11 = 0

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Cosine similarity Does not take into account 0-0
matches, as in Jaccard, and works
If d1 and d2 are numeric vectors, then for non-binary vectors
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates the dot product of the vectors, d1 and d2, and || d || is the
magnitude of the vector d.
Eg.:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Linear correlation measure

corr(x,y)=1 It means a perfect positive correlation between the two variables.

corr(x,y)=-1 It means a perfect negative correlation between the two variables - That is, if
one increases, the other always decreases.
corr(x,y)=0 It means that the two variables do not depend linearly on each other. However,
there may be a non-linear dependence.
What about categorical data?
A = (-3, -2, -1, 0, 1, 2, 3)
B = (a, a, b, a, a, b, b)
What about categorical data?
A = (-3, -2, -1, 0, 1, 2, 3) Codes
B = (a, a, b, a, a, b, b) a=0
b=1

B = (0, 0, 1, 0, 0, 1, 1)
Binarization
Maps a continuous or categorical attribute to one or
more binary variables
Normalization (z-score)
Also known as standardization

where μ is the mean (average) and σ is the standard deviation of the mean

Standardizes features so that they are centered around 0 with a

standard deviation of 1.
This can be a general requirement for many machine learning algorithms.
Normalization (MIN-MAX)
Typically

In this approach, data is scaled to a ﬁxed range - usually from 0 to 1.

Normalization - Example
[[ 3.9 5. 3000. ]
[ 5. 5.5 3500. ]
[ 10. 6. 3500. ]]
Distances between non-normalized objects
[[0. 500.00 500.038]
[500.0014 0. 5.0249]
[500.03820 5.024 0. ]]
Distances between normalized objects
[[0. 1.13248317 1.7320]
[1.13248317 0. 0.96013]
[1.73205081 0.96013 0. ]]
References
Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining.
Pearson Education India.

Thanks to Professors Josh Starmer, Yi Zhang, Vincent Spruyt for some

images that were used

Oﬃcial documentation of the Scikit Learn library scikit-learn.org

Translations: Explore Developing The Definition of A Translation
No ratings yet
Translations: Explore Developing The Definition of A Translation
2 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Lab 2
No ratings yet
Lab 2
21 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
02data Part4
No ratings yet
02data Part4
28 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Lec 5
No ratings yet
Lec 5
24 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Lec 5
No ratings yet
Lec 5
22 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Unit 3
No ratings yet
Unit 3
13 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
1-Data-mining
No ratings yet
1-Data-mining
47 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
17 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
CSC 522 Lecture10
No ratings yet
CSC 522 Lecture10
30 pages
Clustering
0% (1)
Clustering
127 pages
Machile Learning Mid Note
No ratings yet
Machile Learning Mid Note
7 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Ch-8 Motion: Difference Between Uniform and Non Uniform Motion
No ratings yet
Ch-8 Motion: Difference Between Uniform and Non Uniform Motion
4 pages
Showfile
No ratings yet
Showfile
130 pages
Tangent Prop and Manifold Tangent Classifier Are B
No ratings yet
Tangent Prop and Manifold Tangent Classifier Are B
4 pages
Week2 - CE211 - Topic 2b - Pacing
No ratings yet
Week2 - CE211 - Topic 2b - Pacing
18 pages
Summary
No ratings yet
Summary
47 pages
DS Unit 1 Essay Answers.
No ratings yet
DS Unit 1 Essay Answers.
18 pages
2.1 - Lesson 1 - Ball On Ramp Activity
No ratings yet
2.1 - Lesson 1 - Ball On Ramp Activity
2 pages
Guzzardi Ruggiero Boscovich's Theory of Natural Philosophy Points, Distances, Determinations
No ratings yet
Guzzardi Ruggiero Boscovich's Theory of Natural Philosophy Points, Distances, Determinations
212 pages
Machine Learning Rectangle Placement
No ratings yet
Machine Learning Rectangle Placement
3 pages
Karl Storz4
No ratings yet
Karl Storz4
8 pages
Chapter 2
No ratings yet
Chapter 2
48 pages
A Movable Technological Simulation System For Kinematic Analysis To Provide Immediate Accurate Feedback and Predict Javelin Throw Distance
No ratings yet
A Movable Technological Simulation System For Kinematic Analysis To Provide Immediate Accurate Feedback and Predict Javelin Throw Distance
9 pages
C Hidegkuti, Powell, 2012 Last Revised: October 16, 2012
No ratings yet
C Hidegkuti, Powell, 2012 Last Revised: October 16, 2012
3 pages
2009 Murray Location Theory
No ratings yet
2009 Murray Location Theory
7 pages
Digital Ultrasonic Flaw Detectors For Curved Surface
0% (1)
Digital Ultrasonic Flaw Detectors For Curved Surface
11 pages
10173-Nuffield Rev Advanced Physics TG1pt2
No ratings yet
10173-Nuffield Rev Advanced Physics TG1pt2
243 pages
Motion and Time
No ratings yet
Motion and Time
26 pages
Linear Measurement & Chain Surveying
No ratings yet
Linear Measurement & Chain Surveying
68 pages
Lesson Guide: Science 7
No ratings yet
Lesson Guide: Science 7
10 pages
Tri Seal Hub Odometer
No ratings yet
Tri Seal Hub Odometer
2 pages
Formula Panjang X Lebar Sifir 1 Hingga 9
100% (1)
Formula Panjang X Lebar Sifir 1 Hingga 9
2 pages
Circumference
No ratings yet
Circumference
4 pages
FFFFFFF
No ratings yet
FFFFFFF
14 pages
Worksheet 09
No ratings yet
Worksheet 09
11 pages
NCERT - Class7 Motion and Time
No ratings yet
NCERT - Class7 Motion and Time
8 pages
Planter Box Catalogue New
No ratings yet
Planter Box Catalogue New
104 pages
Activity Class 10 - 1
No ratings yet
Activity Class 10 - 1
6 pages
Activity Sheet
No ratings yet
Activity Sheet
14 pages

Class 1c - DataFundamentals

Uploaded by

Class 1c - DataFundamentals

Uploaded by

Prof.

Data Mining &

1c - Data - Important Aspects

An attribute is a property of an object

Examples: eye color, temperature, etc.

Attribute is also known as variable, characteristic, feature

Data has a natural zero point

Allows comparisons of the

– Nominal, Ordinal, or Interval?

Is age continuous or discrete?

A special type of record, where:

For convenience, proximity refers to similarity or dissimilarity

Standardization is necessary if the scale diﬀers

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

corr(x,y)=1 It means a perfect positive correlation between the two variables.

Standardizes features so that they are centered around 0 with a

In this approach, data is scaled to a ﬁxed range - usually from 0 to 1.

Thanks to Professors Josh Starmer, Yi Zhang, Vincent Spruyt for some

Oﬃcial documentation of the Scikit Learn library scikit-learn.org

You might also like