DM Day2 DataUnderstanding MS S25
DM Day2 DataUnderstanding MS S25
Spring 2025
Section A
Email: [email protected]
Textbook(s)/Supplementary Readings
Data Mining: Concepts and Techniques,
J. Han, J. Pei, and H. Tong, 4 th Edition, Morgan
Kaufmann Publishers, 2023.
J. Han, M. Kamber, and J. Pei, Third Edition, Morgan
Kaufmann Publishers, 2012.
Reference:
Introduction to Data Mining,
V. Tan et al.
Addison-Wesley, 2009.
Data Mining: Practical Machine Learning Tools and
Techniques,
Ian H. Witten, Eibe Frank and Mark A. Hall,
Third Edition, Morgan Kaufmann Publishers, 2011
Final Exam Will cover the entire course. At least 70% 40%
of the material would be post mid term.
Summary
8
What Is Data Mining?
10 © Prentice Hall
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
Data mining plays an essential
role in the knowledge
discovery process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
11 Databases
Data Mining Functionalities
Characterization
Discrimination
Association and Correlation Analysis (Ch.
6)
Classification (Ch. 8)
Regression
Clustering (Ch. 10)
Outlier Analysis
Data Characterization
Summarize the characteristics of customers
who spend more than $5000 a year at
AllElectronics.
Association rule
X Y [support, confidence}
Butter, Bread Milk [40%, 100%]
16
Classification
Classification and label prediction
Construct models (functions) based on some training
examples
Describe and distinguish classes or concepts for future
prediction
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support
vector machines, neural networks, rule-based
classification, pattern-based classification, logistic
regression, …
Typical applications:
Credit card fraud detection, direct marketing, disease
17 identification, …
Regression
classification predicts categorical (discrete, unordered) labels,
regression
models continuous-valued functions.
Numeric Prediction
Cluster Analysis
Outlier analysis
Outlier: A data object that does not comply with
the general behavior of the data
Noise or exception? ― One person’s garbage
could be another person’s treasure
Methods: by product of clustering or regression
analysis, …
Useful in fraud detection, rare events analysis
20
Homework
Read chapter 1 of your Data Mining textbook.
Data Visualization
Summary
23
Data Objects
Data Object
Represents an entity
Sales database
Customers
Store items
Sales
Data Objects
Data Object
Represents an entity
Medical database
Patients
Doctors
…
Data Objects
Data Object
Represents an entity
University database
Students
Professors
Courses
Data Objects
Data objects can also be referred to as
Samples
Examples
Instances
Data points
Objects
Tuples
Data Objects
Data Object
Represents an entity
Attributes
Used to describe Data objects
Attribute
An attribute is a data field, representing a
characteristic or feature of a data object
A.k.a.
Dimension (Data Warehouse)
Feature (Machine Learning)
Variable (Statistics)
Attribute
Customer (object)
Customer ID
Name
Address
Student (object)
???
???
…
Observations
Observed values for a given attribute are
known as observations
Student Object
Attributes Observations
Name Mushtaq
ID F2012065219
CGPA 3.84
Hamna
S2011599031
3.77
Attribute Vector (or Feature Vector)
A set of attributes used to describe a
given object
Customer Object
Customer ID, Name, Address
Student Object
Name, ID, CGPA
Data, Data; Who are You?
Data, Data; Who are You?
I am Quality
I am Quantity
Data, Data; Who are You?
Types of Attributes
Nominal
Binary
Ordinal
Numeric
Nominal Attributes
Relating to Names
Category
Undergraduate, Graduate
Code
065, 266, 105, 288
State
Present, Absent
Binary Attributes
A nominal attribute with only two
categories or states: 0 or 1
A.k.a.
Boolean Attributes
if the two states correspond to true and false
Binary Attributes
Symmetric
Drink Size :
small, medium, and large
Grade :
A+, A, A-, B+, and so on
Ordinal Attributes
Ordinal attributes are often used in surveys
for Ratings
Binary
Ordinal
Numeric Attributes
A numeric attribute is quantitative
i.e. It is a measurable quantity
Represented in
integer or real values
Inherent zero-point
https://quizzma.com/levels-of-
measurement-quiz/
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a
collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of
discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and
represented using a finite number of digits
Continuous attributes are typically represented as
floating-point variables
56
Data set
A data set is a collection of numbers or
values that relate to a particular subject or
task.
The test scores of each student in a
particular class
The transactions at a store
Median
Mode
Mean
Let x1,x2, …,xN be a set of N values or
observations, such as for some numeric
attribute X, like salary
number of values, or
average of the
middle two values
otherwise
Estimated by
interpolation (for
grouped data):
74
Mode
Value that occurs most frequently
Trimodal Data
Three values occur most frequently
(with same frequency)
Example 2.6 Data: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110
Mode
In general, a data set with two or more
modes is multimodal
Mode
For Grouped Data
79
Midrange
The midrange can also be used to assess
the central tendency of a numeric data set
Dispersion
Range, Quartiles, and Interquartile Range
Variance and Standard Deviation
Choose the mobile, M1 or
M2?
M1
Average of Customer ratings: 3
M2
Average of Customer ratings: 3
M2
Customer ratings: 2, 3, 3, 3, 3, 3, 3, 4
Range: 110-30 = 80
Percent vs Percentile?
7/10
70 percent
100 percentile: 7 is the point below which we have 99.99% of the data
Deciles
The 10-quantiles comprising of the nine
data points that split the data distribution
into ten equal parts
Q1
Median
Q3
Maximum Value
Example: Statistical Description of
Data – 5 Number Summary
Data: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70,
70, 110
Q1 = 47
Q2 = 52
Q3 = 63
IQR = 16
Min = ?
Max = ?
Normal Data
Q1-(1.5)(IQR) Q3 + (1.5)(IQR)
47 – (1.5)(16) 63+ (1.5)(16)
47 – 24 63 + 24
23 87
Five Number Summary
(min, Q1, Q2, Q3, max) = (30, 47, 52, 63, 70)
Measuring the Dispersion of Data
Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
(x x
2
2 i )2 i 2
N i 1 Ni 1
10
0
Boxplot Analysis
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the box
Whiskers: two lines outside the box extended to
Minimum and Maximum
Outliers: points beyond a specified outlier
threshold, plotted individually
10
1
Boxplot in Matlab
>> d = [30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, 110];
>> boxplot(d);
Histogram Analysis
Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
It shows what proportion of 30
cases fall into each of several 25
categories 20
The categories are usually
15
specified as non-overlapping
10
intervals of some variable. The
5
categories (bars) must be
adjacent 0
10000 30000 50000 70000 90000
10
5
Histograms Often Tell More than Boxplots
1, 1, 1, 3, 5, 5, 5
1, 3, 3, 3, 3, 3, 5
10
6
Properties of Normal Distribution Curve
10
7
Quantile Plot
Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
Plots quantile information
For a data x data sorted in increasing order, f
i i
indicates that approximately 100 fi% of the data
are below or equal to the value xi
10
Data Mining: Concepts and
8
Techniques
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
View: Is there is a shift in going from one distribution to
another?
Example shows unit price of items sold at Branch 1 vs. Branch 2
for each quantile. Unit prices of items sold at Branch 1 tend to
be lower than those at Branch 2. Q3
Q2
Q1
10
9
Scatter plot
Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
Ice Cream Sales vs
Temperature Ice Cream Sales vs Temperature
$700
Temperature Ice Cream $600
°C Sales
$500
14.2 $215
16.4 $325 $400
Sales
positively correlated
The right half is negative
11 correlated
1
Uncorrelated Data
11
2
Chapter 2: Getting to Know Your
Data
Data Objects and Attribute Types
Data Visualization
Summary
11
3
Data Visualization
Pixel-Oriented Visualization Techniques
For a data set of m dimensions, create m windows on the screen, one for
each dimension
The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
The colors of the pixels reflect the corresponding values
11
6
IRIS Dataset
https://archive.ics.uci.edu/ml/datasets/iris
http://support.sas.com/documentation/
Icon-Based Visualization
Techniques
Visualization of the data values as features of icons
Typical visualization methods
Chernoff Faces
Stick Figures
General techniques
Shape coding: Use shape to represent certain
information encoding
Color icons: Use color icons to encode more
information
12
0
Chernoff Faces
A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
The figure shows faces produced using 10 characteristics--head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,
mouth shape, mouth size, and mouth opening): Each assigned one of 10
possible values, generated using Mathematica (S. Dickson)
12
1
Hierarchical Visualization Techniques
Visualization of the data using a
hierarchical partitioning into subspaces
Methods
Worlds-within-Worlds
Tree-Map
12
2
Worlds-within-Worlds
x1 x2 x3 x4
x2 x4
2 3 2 1 3
4 4 3 2 6
6 4 1 1 2
4 4 1 2
2 3 3 3 4
1
(2,3) 1 2 3 x3
2
2 4 6
x1
Tree-Map
Screen-filling method which uses a hierarchical partitioning of
the screen into regions depending on the attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
12 https://support.office.com/
4
Visualizing Complex Data and
Relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags (Try https://www.wordclouds.com/)
Data Visualization
Summary
12
8
Data Similarity and
Dissimilarity
In Data Mining, we often need ways to assess
how alike or unalike objects are in comparison to
one another
Data Similarity and Dissimilarity
Dis(similarity) computation is important for
many Data Mining Tasks, e.g.,
Clustering: “... objects within a cluster are
similar to one another and dissimilar to the
objects in other clusters”
Outlier Analysis: “Outliers as objects that
are highly dissimilar to others
A(2, 4)
B(4, 7) B
ou suggest two teams of students from the above data, based on simila
Measures of Proximity
Similarity Measures
How much alike objects are in comparison to
one another
Dissimilarity Measures
How much unalike objects are in comparison
to one another
Data Matrix
Object-by-Attribute Structure
A.k.a. Two-Mode Matrix
Rows
Objects
Columns
Attributes
NxP
N is the number of objects
P is the number of attributes
Dissimilarity Matrix
Object-by-Object Structure
Value at (i, j) position represents the
measured dissimilarity or “difference”
between objects i and j
d(1,2)
N objects
Size?
NxN
m
number of matches (i.e., the number of
attributes for which objects i and j are in the
same state)
p
total number of attributes describing the objects
Proximity Measures for Nominal
Attributes
d(2,1) = ?
d(3,1) = ?
d(3,2) = ?
d(4,1) = ?
d(4,2) = ? Student City
d(4,3) = ? Ali Lahore
Bilal Karachi
Javed Multan
Hint:
Aslam Lahore
Proximity Measures for Nominal
Attributes
Proximity Measures for Nominal
Attributes
Proximity Measures for Nominal
Attributes
Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A
p = ?
Proximity Measures for Nominal
Attributes
Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A
p = 4
Proximity Measures for Nominal
Attributes
Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A
Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A
d(2,1) = 1
d(3,1) = 0.75
d(3,2) = 0.75
d(4,1) = 0.5
d(4,2) = 0.75
d(4,3) = 0.25
Proximity Measures for Nominal
Attributes
Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A
d(2,1) = 1 sim(2,1) = 0
d(3,1) = 0.75 sim(3,1) = 0.25
d(3,2) = 0.75 sim(3,2) = 0.25
d(4,1) = 0.5 sim(4,1) = 0.5
d(4,2) = 0.75 sim(4,2) = 0.25
d(4,3) = 0.25 sim(4,3) = 0.75
Proximity Measure for Nominal Attributes
d (i, j) p p m
m: # of matches, p: total # of variables
14
5
Activity
Person Hair_color Marital_statu Profession
s
Ali Black Single Engineer
Bilal Brown Single Engineer
Javed Black Married Banker
p r
number of attributes
total number of
that equal 1 for object
attributes i but equal 0 for object
j
q
s
number of number of attributes
attributes that that equal 0 for object
equal 1 for both i but equal 1 for object
j
objects i and j t
number of attributes
that equal 0 for both
objects i and j
Proximity Measures for Binary
Attributes
A B C D E
Ali 1 0 0 1 0
Ahmed 1 1 0 0 1
Asymmetric Binary Similarity
A.k.a. Jaccard Coefficient
Asymmetric Binary Dissimilarity
Euclidean Distance
Manhattan Distance
Supremum Distance
Minkowski Distance
Euclidean Distance
Straight Line Distance
i = (2, 4, 8)
j = (4, 3, 4)
d(i,j) = sqrt(4+1+16) = 4.58
Manhattan Distance
Given two objects i & j described by p
numeric attributes
i = (xi1, xi2, … , xip)
j = (xj1, xj2, … , xjp)
(2, 4, 8)
(4, 3, 4)
Manhattan D = |2-4| + |4-3| + |8-
4| = 7
Manhattan Distance
City Block Distance
https://www.google.com/maps/place/Manhattan,
+New+York,+NY,+USA/
Supremum Distance
Euclidean, Manhattan, and
Supremum distances
Supremum distance
= max ( |1-3|, |2-5|)
= max ( 2, 3)
=3
Minkowski Distance
A generalization of the Euclidean and
Manhattan distances
Manhattan Distance
h = 1
Euclidean Distance
h = 2
Minkowski Distance
A generalization of the Euclidean and
Manhattan distances