0% found this document useful (0 votes)
6 views165 pages

DM Day2 DataUnderstanding MS S25

The document outlines the course structure for a Data Mining class taught by Dr. Malik Tahir Hassan at the University of Management for Spring 2025, detailing the schedule, grading policy, and classroom rules. It covers essential topics such as data understanding, data mining functionalities, and various data types, along with recommended textbooks and tools. The course emphasizes attendance, participation, and adherence to academic integrity policies.

Uploaded by

s2024393005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views165 pages

DM Day2 DataUnderstanding MS S25

The document outlines the course structure for a Data Mining class taught by Dr. Malik Tahir Hassan at the University of Management for Spring 2025, detailing the schedule, grading policy, and classroom rules. It covers essential topics such as data understanding, data mining functionalities, and various data types, along with recommended textbooks and tools. The course emphasizes attendance, participation, and adherence to academic integrity policies.

Uploaded by

s2024393005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 165

Data Mining (DM)

Spring 2025
Section A

Day 2: Data Understanding

Dr. Malik Tahir Hassan, University of Management and


Lecture Schedule
• Alternate weekend schedule for a course
• Total 6 Days per course, then final exam
• 7.5 hours of teaching contact in a day
Instructor
Dr. Malik Tahir Hassan
Associate Professor
 Office: SDT-404
 Counseling Hours:
 Monday – Friday (2:00 pm – 4:00 pm),

 Email: [email protected]
Textbook(s)/Supplementary Readings
 Data Mining: Concepts and Techniques,
 J. Han, J. Pei, and H. Tong, 4 th Edition, Morgan
Kaufmann Publishers, 2023.
 J. Han, M. Kamber, and J. Pei, Third Edition, Morgan
Kaufmann Publishers, 2012.

Reference:
Introduction to Data Mining,
 V. Tan et al.
 Addison-Wesley, 2009.
 Data Mining: Practical Machine Learning Tools and
Techniques,
 Ian H. Witten, Eibe Frank and Mark A. Hall,
 Third Edition, Morgan Kaufmann Publishers, 2011

 Tools and Technologies


Weka
C++ or Java, Matlab, Python
Grading Policy
Instrument Description Weigh
t
Class Exercises In-class exercises and evaluation 10%
Assignments/ Assigned during important stages of the 10%
Project course to apply and practice the learnt
concepts
Quizzes In-class (un)announced 15 minutes tests 15%
Mid-Term Exam A single 90-minutes exam from the 25%
material covered

Final Exam Will cover the entire course. At least 70% 40%
of the material would be post mid term.

te Submission Policy: Late submissions are not allowed


Classroom Policy
1. Attendance is very important, 80% is required,
100% is recommended.
2. Keep your mobiles switched off
3. Females sit on the right while facing white board
4. Quizzes can be announced or unannounced. 1
quiz would be dropped out of 4 or 5 quizzes. No
retake for quizzes.
5. The plagiarism and cheating cases would be
reported to the Disciplinary Committee.
6. Moodle will be the resource sharing medium,
keep checking the Moodle page and your emails
regularly.
Course Outline, Plan
 Chapters of the textbook
Introduction (Ch. 1) Day 1
Data Understanding (Ch. 2) Day 2
Data Preparation (a.k.a. Data Pre-processing) (Ch. 3)
Day 3
Frequent Patterns and Association Mining (Ch. 6) Day 4
 Day 4 Mid Term Exam
Classification (Ch. 8) Day 5
Clustering (Ch. 10) Day 6
Final Exam

 See the document


Available on Moodle course page
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 What Kinds of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Kinds of Technologies Are Used?

 What Kinds of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary

8
What Is Data Mining?

 Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

It involves analysing data to discover hidden patterns,


correlations, anomalies, or relationships that can be used to
make informed decisions, predictions, or recommendations.

Data mining: a misnomer?


 Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
9 intelligence, etc.
Data Mining Models and Tasks

10 © Prentice Hall
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
 Data mining plays an essential
role in the knowledge
discovery process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

11 Databases
Data Mining Functionalities
Characterization
Discrimination
Association and Correlation Analysis (Ch.
6)
Classification (Ch. 8)
Regression
Clustering (Ch. 10)
Outlier Analysis
Data Characterization
Summarize the characteristics of customers
who spend more than $5000 a year at
AllElectronics.

The result is a general profile of these


customers, such as that they are
40 to 50 years old,
employed, and
have excellent credit ratings.
Data Discrimination
Compare two groups of customers

Those who shop for computer products


regularly (e.g., more than twice a month)

Those who rarely shop for such products


(e.g., less than three times a year)
Data Discrimination
The resulting description provides a general
comparative profile of these customers, such
as that

80% of the customers who frequently purchase


computer products are between 20 and 40
years old and have a university education

Whereas 60% of the customers who infrequently


buy such products are either seniors or youths,
and have no university degree
Frequent Patterns, Association and
Correlation Analysis
 Frequent patterns (or frequent itemsets)
What items are frequently purchased together
on DARAZ?

Association rule
X  Y [support, confidence}
Butter, Bread  Milk [40%, 100%]

 How to mine such patterns and rules efficiently in


large datasets?

16
Classification
 Classification and label prediction
Construct models (functions) based on some training
examples
Describe and distinguish classes or concepts for future
prediction
Predict some unknown class labels
 Typical methods
Decision trees, naïve Bayesian classification, support
vector machines, neural networks, rule-based
classification, pattern-based classification, logistic
regression, …
 Typical applications:
Credit card fraud detection, direct marketing, disease
17 identification, …
Regression
classification predicts categorical (discrete, unordered) labels,
regression
models continuous-valued functions.

Numeric Prediction
Cluster Analysis

 Unsupervised learning (i.e., Class label is unknown)


 Group data to form new categories (i.e., clusters),
e.g., cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity &
minimizing interclass similarity
 Many methods and applications
Market segmentation
Recommender systems
Social network analysis
Education
19
Outlier Analysis

 Outlier analysis
Outlier: A data object that does not comply with
the general behavior of the data
Noise or exception? ― One person’s garbage
could be another person’s treasure
Methods: by product of clustering or regression
analysis, …
Useful in fraud detection, rare events analysis

20
Homework
Read chapter 1 of your Data Mining textbook.

Identify similarities and differences in the following


Database and Datawarehouse
Discrimination and Classification
Characterization and Clustering
Classification and Regression
Classification and Clustering
Descriptive and predictive data mining
Data Mining and Machine Learning
Supervised, Unsupervised and Semi-supervised
Learning
Chapter 2:
Getting to Know your Data
Data Types, Data Statistics, Data Visualization
Chapter 2: Getting to Know Your
Data
Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Data Visualization

Measuring Data Similarity and Dissimilarity

Summary

23
Data Objects
Data Object
Represents an entity

Sales database
Customers
Store items
Sales
Data Objects
Data Object
Represents an entity

Medical database
Patients
Doctors
…
Data Objects
Data Object
Represents an entity

University database
Students
Professors
Courses
Data Objects
Data objects can also be referred to as
Samples
Examples
Instances
Data points
Objects
Tuples
Data Objects
Data Object
Represents an entity

Attributes
Used to describe Data objects
Attribute
An attribute is a data field, representing a
characteristic or feature of a data object

A.k.a.
Dimension (Data Warehouse)
Feature (Machine Learning)
Variable (Statistics)
Attribute
Customer (object)
Customer ID
Name
Address

Student (object)
???
???
…
Observations
Observed values for a given attribute are
known as observations
Student Object
Attributes Observations
Name Mushtaq
ID F2012065219
CGPA 3.84

Hamna
S2011599031
3.77
Attribute Vector (or Feature Vector)
A set of attributes used to describe a
given object

Customer Object
Customer ID, Name, Address

Student Object
Name, ID, CGPA
Data, Data; Who are You?
Data, Data; Who are You?

I am Quality

I am Quantity
Data, Data; Who are You?
Types of Attributes

Nominal

Binary

Ordinal

Numeric
Nominal Attributes
Relating to Names

Values of a nominal attribute are


Symbols, or
names of things
e.g. category, code, or state

A.k.a. Categorical Attributes


Nominal Attributes
Values of a nominal attribute are
Symbols, or
names of things
e.g. category, code, or state

Category
 Undergraduate, Graduate

Code
 065, 266, 105, 288

State
 Present, Absent
Binary Attributes
A nominal attribute with only two
categories or states: 0 or 1

0 typically means that the attribute is


absent, and 1 means that it is present

A.k.a.
Boolean Attributes
 if the two states correspond to true and false
Binary Attributes
Symmetric

If both of its states are equally valuable


and carry the same weight
 That is, there is no preference on which outcome
should be coded as 0 or 1

One such example could be the attribute


gender having the states male and female
Binary Attributes
Asymmetric

If both of its states are NOT equally


valuable

Such as the positive and negative outcomes


of a medical test for HIV
Binary Attributes
Asymmetric

By convention, we code the most important


outcome, by 1
 Which is usually the rarest one (e.g., HIV positive)

And the other by 0


 (e.g., HIV negative)
Ordinal Attributes
An attribute with possible values that have
a meaningful order or ranking among
them

But the magnitude between successive


values is not known

For example, A Pizza 


small, medium, large
Ordinal Attributes
A Pizza 
small, medium, large

Drink Size :
small, medium, and large

Grade :
A+, A, A-, B+, and so on
Ordinal Attributes
Ordinal attributes are often used in surveys
for Ratings

0: very dissatisfied,1: somewhat dissatisfied,


2: neutral, 3: satisfied, and 4: very satisfied.
Qualitative (Aspect of ) Data
Nominal

Binary

Ordinal
Numeric Attributes
A numeric attribute is quantitative
i.e. It is a measurable quantity
Represented in
 integer or real values

Numeric attributes can be


Interval-scaled
Ratio-scaled
Numeric Attributes
Interval-scaled

Measured on a scale of equal-size units

The values can be positive, 0, or negative


Numeric Attributes
Interval-scaled

Not only allow to compare, but also to


quantify the difference between values

However, we can not speak of a value as


being a multiple (or ratio) of another value

 For example, we cannot say that 10◦ C is twice as


warm as 5◦ C
Numeric Attributes
Ratio-scaled

Inherent zero-point

We can speak of a value as being a multiple


(or ratio) of another value

These too, allow to compare, as well as


quantify the difference between values
Class Activity
Online Quiz

https://quizzma.com/levels-of-
measurement-quiz/
Discrete vs. Continuous Attributes
 Discrete Attribute
Has only a finite or countably infinite set of values
 E.g., zip codes, profession, or the set of words in a
collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of
discrete attributes
 Continuous Attribute
Has real numbers as attribute values
 E.g., temperature, height, or weight
Practically, real values can only be measured and
represented using a finite number of digits
Continuous attributes are typically represented as
floating-point variables
56
Data set
A data set is a collection of numbers or
values that relate to a particular subject or
task.
The test scores of each student in a
particular class
The transactions at a store

Data set for a student’s result prediction?


Student’s SGPA Data set
 Example Data set for a student’s SGPA prediction
 Attributes
SemesterNumber
Semester
Year
Number of courses
Number of labs
Credit hours
Number of PhD teachers
Number of Non-PhD teachers
Number of male teachers
Number of female teachers
SGPA
Standard Public Data sets
Standard Public Data sets available online
Kaggle
UCI Machine Learning Repository
Open Data Pakistan
Google Trends
…
Data, Data; Where do you
lie?
Data, Data; Where do You
Lie?

I Lie in the Center

I Lie in the Dispersion


Central Tendency of Data
Mean

Median

Mode
Mean
Let x1,x2, …,xN be a set of N values or
observations, such as for some numeric
attribute X, like salary

The mean of this set of values is


Weighted Arithmetic Mean
Sometimes, each value xi in a set may be
associated with a weight wi
The weights reflect the significance,
importance, or occurrence frequency
attached to their respective values
Weigthed Mean Example
Quiz Average?

Quiz Marks out of 10


Q1 2
Q2 4
Q3 3
Q4 7
Weigthed Mean Example
Quiz Average?

Quiz Marks out of Weight


10
Q1 2 1
Q2 4 1
Q3 3 1
Q4 7 2
Mean
A useful Measure of Central Tendency of
Data

But has some Issues


sensitivity to extreme values
 Even a small number of extreme values can corrupt
the mean
 The mean salary at a company may be
substantially pushed up by that of a few highly
paid managers
Trimmed Mean
Useful in avoiding Sensitivity to extreme
values

Mean obtained after chopping off values at


the high and low extremes
For example, sort the values observed for
salary and remove the top and bottom 2%
before computing the mean
 Avoid trimming too large a portion (such as 20%) at
both ends
Mean
A useful Measure of Central Tendency of
Data when data is Symmetric

Not a good measure for Asymmetric


(Skewed) Data
Median
A better measure of the center of data,
for skewed (asymmetric) data

It is the middle value in a set of ordered


(sorted) data values
i.e. it is the value that separates the higher
half of a data set from the lower half.

We may extend the concept to ordinal


data
Example 2.6 Data: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110
Measuring the Central Tendency…
 Median:
 Middle value if odd

number of values, or
average of the
middle two values
otherwise
 Estimated by

interpolation (for
grouped data):

74
Mode
Value that occurs most frequently

Can be determined for qualitative and


quantitative attributes

Unimodal, Bimodal, and Trimodal data


Mode
Unimodal Data
One value occurs most frequently
Bimodal Data
Two values occur most frequently
 (with same frequency)

Trimodal Data
Three values occur most frequently
 (with same frequency)
Example 2.6 Data: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110
Mode
 In general, a data set with two or more
modes is multimodal

At the other extreme, if each data value


occurs only once, then there is no mode
Measuring the Central Tendency…

 Mode
 For Grouped Data

79
Midrange
The midrange can also be used to assess
the central tendency of a numeric data set

It is the average of the largest and smallest


values in the set
Example 2.6 Data: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110
Dispersion of Data

Range, Quartiles, and Interquartile Range

Variance and Standard Deviation


Why Dispersion of Data?
Are the following three datasets different?
1, 3, 5 (Mean = 3)
2, 3, 4 (Mean = 3)
3, 3, 3 (Mean = 3)
Why Dispersion of Data?
1, 3, 5 (Mean = 3),
Dispersion/spread/Range? 4
2, 3, 4 (Mean = 3)
2
3, 3, 3 (Mean = 3)
0

Dispersion
Range, Quartiles, and Interquartile Range
Variance and Standard Deviation
Choose the mobile, M1 or
M2?
M1
Average of Customer ratings: 3

M2
Average of Customer ratings: 3

Which mobile you will prefer to buy?


M1 or M2
Choose the mobile, M1 or
M2?
M1
Customer ratings: 1, 1, 1, 1, 5, 5, 5, 5

M2
Customer ratings: 2, 3, 3, 3, 3, 3, 3, 4

Which mobile you will prefer to buy?


M1 or M2
Choose the mobile, M1 or
M2?
M1
Customer ratings: 1, 1, 1, 1, 5, 5, 5, 5
Mean = 3, Dispersion = 5-1 = 4
M2
Customer ratings: 2, 3, 3, 3, 3, 3, 3, 4
Mean = 3, Dispersion = 4 – 2 = 2
Which mobile you will prefer to buy?
M1 or M2
Consistent?
Risky?
Range, Variance
The range of a data set is the difference
between the largest and smallest values
Example 2.6 Data: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110

Range: 110-30 = 80

Variance and standard deviation


Variance: (algebraic, scalable computation)
n n
1 1
  xi   2
2
 
2
( xi  
2
) 
N i 1 N i 1

Standard deviation s (or σ) is the square root of variance s2 (or σ2)


Quantiles
Quantiles are points taken at regular
intervals of a data distribution, dividing it
into essentially equal size consecutive sets
Quantiles
The kth q-quantile for a given data
distribution is the value x such that at most
k/q of the data values are less than x and
at most (q − k)/q of the data values are
more than x
Quartiles
The 4-quantiles comprising of the three
data points that split the data distribution
into four equal parts

Each part represents one-fourth of the data


distribution.
Marks out of 10: 2, 3, 3, 4, 5, 6, 7
Median = 4 = 40 percent = 50 percentile
7/10 = 70 percent = 100 percentile

Percent vs Percentile?
7/10
70 percent
100 percentile: 7 is the point below which we have 99.99% of the data
Deciles
The 10-quantiles comprising of the nine
data points that split the data distribution
into ten equal parts

Each part represents one-tenth of the data


distribution.
Percentiles
The 100-quantiles comprising of the ninety-
nine data points that split the data
distribution into hundred equal parts

Each part represents one-hundredth of the


data distribution.
Interquartile Range (IQR)
The distance between the first and third
quartiles
It is a simple measure of spread that gives
the range covered by the middle half of the
data
Example 2.6 Data: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110
Q1 = 12/4 = 3rd value = 47
Q2 = 3*2 = 6th value = 52
Q3 = 3*3 = 9th value = 63

IQR = Q3-Q1 = 63-47 = 16


Five-number Summary
Minimum Value

Q1

Median

Q3

Maximum Value
Example: Statistical Description of
Data – 5 Number Summary
Data: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70,
70, 110
 Q1 = 47
 Q2 = 52
 Q3 = 63
 IQR = 16
 Min = ?
 Max = ?
 Normal Data
Q1-(1.5)(IQR)  Q3 + (1.5)(IQR)
47 – (1.5)(16)   63+ (1.5)(16)
47 – 24   63 + 24
23  87
 Five Number Summary
(min, Q1, Q2, Q3, max) = (30, 47, 52, 63, 70)
Measuring the Dispersion of Data

 Quartiles, outliers and boxplots


 Quartiles: Q1 (25th percentile), Q3 (75th percentile)

 Inter-quartile range: IQR = Q3 – Q1

 Five number summary: min, Q1, median, Q3, max

 Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and

plot outliers individually


 Outlier: usually, a value higher/lower than 1.5 x IQR from Q3/Q1

 Variance and standard deviation


 Variance: (algebraic,
1 n scalable1computation)
n

 (x x
2
2  i   )2  i  2
N i 1 Ni 1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)


99
Graphic Displays of Basic Statistical
Descriptions

 Boxplot: graphic display of five-number summary


 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane

10
0
Boxplot Analysis

 Five-number summary of a distribution


Minimum, Q1, Median, Q3, Maximum

 Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the box
Whiskers: two lines outside the box extended to
Minimum and Maximum
Outliers: points beyond a specified outlier
threshold, plotted individually

10
1
Boxplot in Matlab
>> d = [30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, 110];
>> boxplot(d);
Histogram Analysis
 Histogram: Graph display of
tabulated frequencies, shown as 40

bars 35
 It shows what proportion of 30
cases fall into each of several 25
categories 20
 The categories are usually
15
specified as non-overlapping
10
intervals of some variable. The
5
categories (bars) must be
adjacent 0
10000 30000 50000 70000 90000

10
5
Histograms Often Tell More than Boxplots

 The two histograms shown in the


left may have the same boxplot
representation
 The same values for: min,
Q1, median, Q3, max
 But they have rather different
data distributions

1, 1, 1, 3, 5, 5, 5
1, 3, 3, 3, 3, 3, 5

10
6
Properties of Normal Distribution Curve

 The normal (distribution) curve


From μ–σ to μ+σ: contains about 68% of the measurements (μ:
mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it

10
7
Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information
For a data x data sorted in increasing order, f
i i
indicates that approximately 100 fi% of the data
are below or equal to the value xi

10
Data Mining: Concepts and
8
Techniques
Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to
another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2
for each quantile. Unit prices of items sold at Branch 1 tend to
be lower than those at Branch 2. Q3

Q2

Q1

10
9
Scatter plot
 Provides a first look at bivariate data to see
clusters of points, outliers, etc
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
Ice Cream Sales vs
Temperature Ice Cream Sales vs Temperature
$700
Temperature Ice Cream $600
°C Sales
$500
14.2 $215
16.4 $325 $400
Sales

11.9 $185 $300


15.2 $332 $200
18.5 $406
$100
22.1 $522
$0
19.4 $412 10 12 14 16 18 20 22 24 26
25.1 $614 Temperature
23.4 $544
18.1 $421
22.6 $445
11 17.2 $408
0
Positively and Negatively Correlated Data

 The left half fragment is

positively correlated
 The right half is negative

11 correlated
1
Uncorrelated Data

11
2
Chapter 2: Getting to Know Your
Data
Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Data Visualization

Measuring Data Similarity and Dissimilarity

Summary

11
3
Data Visualization
Pixel-Oriented Visualization Techniques
 For a data set of m dimensions, create m windows on the screen, one for
each dimension
 The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
 The colors of the pixels reflect the corresponding values

(a) Income (b) Credit (c) transaction (d) age


Limit volume 115
Geometric Projection Visualization
Techniques
 Visualization of geometric transformations and
projections of the data
 Methods
Scatterplot and scatterplot matrices
Parallel coordinates
Landscapes

11
6
IRIS Dataset

UC Irvine Machine Learning Repository

https://archive.ics.uci.edu/ml/datasets/iris
http://support.sas.com/documentation/
Icon-Based Visualization
Techniques
 Visualization of the data values as features of icons
 Typical visualization methods
Chernoff Faces
Stick Figures
 General techniques
Shape coding: Use shape to represent certain
information encoding
Color icons: Use color icons to encode more
information

12
0
Chernoff Faces
 A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics--head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,
mouth shape, mouth size, and mouth opening): Each assigned one of 10
possible values, generated using Mathematica (S. Dickson)

 REFERENCE: Gonick, L. and Smith, W.


The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
 Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html

12
1
Hierarchical Visualization Techniques
Visualization of the data using a
hierarchical partitioning into subspaces
Methods
Worlds-within-Worlds
Tree-Map

12
2
Worlds-within-Worlds

x1 x2 x3 x4
x2 x4
2 3 2 1 3
4 4 3 2 6
6 4 1 1 2
4 4 1 2
2 3 3 3 4
1

(2,3) 1 2 3 x3
2

2 4 6

x1
Tree-Map
 Screen-filling method which uses a hierarchical partitioning of
the screen into regions depending on the attribute values
 The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)

12 https://support.office.com/
4
Visualizing Complex Data and
Relations
 Visualizing non-numerical data: text and social networks
 Tag cloud: visualizing user-generated tags (Try https://www.wordclouds.com/)

 The importance of tag is


represented by font size/color
 Besides text data, there are also
methods to visualize relationships, such
as visualizing social networks

Newsmap: Google News Stories in 2005


Activity
Explore the following for data visualization
RAWgraphs 2.0 (https://app.rawgraphs.io/)
https://www.wordclouds.com/
Google Charts (
https://developers.google.com/chart)
…
Chapter 2: Getting to Know Your
Data
Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Data Visualization

Measuring Data Similarity and Dissimilarity

Summary

12
8
Data Similarity and
Dissimilarity
In Data Mining, we often need ways to assess
how alike or unalike objects are in comparison to
one another
Data Similarity and Dissimilarity
Dis(similarity) computation is important for
many Data Mining Tasks, e.g.,
Clustering: “... objects within a cluster are
similar to one another and dissimilar to the
objects in other clusters”
Outlier Analysis: “Outliers as objects that
are highly dissimilar to others
A(2, 4)

B(4, 7) B

D(A,B) = Sqrt( (2-4)2 + (4-7)2 ) A


=3
D(B,A) = 3
Dissimilarity Computation
Computation of Dissimilarity depends upon
types of attributes
Attributes can be Nominal, Binary, Ordinal
or Numeric
Student City Age Semeste Hostel
r
Ali Lahore 20 First No
Bilal Karachi 24 Sixth Yes
Javed Multan 25 Fifth No
Aslam Lahore 23 Seventh No

ou suggest two teams of students from the above data, based on simila
Measures of Proximity
Similarity Measures
How much alike objects are in comparison to
one another

Dissimilarity Measures
How much unalike objects are in comparison
to one another
Data Matrix
Object-by-Attribute Structure
A.k.a. Two-Mode Matrix
Rows
Objects

Columns
Attributes

NxP
N is the number of objects
P is the number of attributes
Dissimilarity Matrix
Object-by-Object Structure
Value at (i, j) position represents the
measured dissimilarity or “difference”
between objects i and j
d(1,2)

N objects
Size?
NxN

One kind of Entity


dissimilarities

Hence called One-Mode Matrix


Proximity Measures for Nominal
Attributes

Dissimilarity based on the ratio of Mismatches

m
number of matches (i.e., the number of
attributes for which objects i and j are in the
same state)
p
total number of attributes describing the objects
Proximity Measures for Nominal
Attributes

Dissimilarity based on the ratio of


Mismatches
Student City
Ali Lahore
Bilal Karachi
Javed Multan
Aslam Lahore
p = ?
Proximity Measures for Nominal
Attributes

d(2,1) = ?
d(3,1) = ?
d(3,2) = ?
d(4,1) = ?
d(4,2) = ? Student City
d(4,3) = ? Ali Lahore
Bilal Karachi
Javed Multan
Hint:
Aslam Lahore
Proximity Measures for Nominal
Attributes
Proximity Measures for Nominal
Attributes
Proximity Measures for Nominal
Attributes

Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A
p = ?
Proximity Measures for Nominal
Attributes

Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A
p = 4
Proximity Measures for Nominal
Attributes

Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A

 d(2,1) = ? =(p-m)/p = (4-0)/4 = 1


 d(3,1) = ? = (4-1)/4 = ¾ = 0.75
 d(3,2) = ?
 d(4,1) = ?
 d(4,2) = ?
 d(4,3) = ?
Dissimilarity = D(4,3) = ¼ = 0.25
Dissimilarity = ¾ = 0.75
Proximity Measures for Nominal
Attributes

Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A

 d(2,1) = 1
 d(3,1) = 0.75
 d(3,2) = 0.75
 d(4,1) = 0.5
 d(4,2) = 0.75
 d(4,3) = 0.25
Proximity Measures for Nominal
Attributes

Object ID P1 P2 P3 P4
1 A X C L
2 B Y G W
3 A Y K A
4 A Y C A

 d(2,1) = 1  sim(2,1) = 0
 d(3,1) = 0.75  sim(3,1) = 0.25
 d(3,2) = 0.75  sim(3,2) = 0.25
 d(4,1) = 0.5  sim(4,1) = 0.5
 d(4,2) = 0.75  sim(4,2) = 0.25
 d(4,3) = 0.25  sim(4,3) = 0.75
Proximity Measure for Nominal Attributes

Can take 2 or more states, e.g., red, yellow, blue, green


Method 1: Simple matching

d (i, j)  p p m
m: # of matches, p: total # of variables

Method 2: Use a large number of binary attributes


creating a new binary attribute for each of the M nominal
states
HairColor: Grey, Black, White (nominal attribute with three
states)
 3 states so three binary variables
 isGrey, isBlack, isWhite

14
5
Activity
Person Hair_color Marital_statu Profession
s
Ali Black Single Engineer
Bilal Brown Single Engineer
Javed Black Married Banker

Compute dissimilarity matrix


Proximity Measures for Binary
Attributes

p r
number of attributes
total number of
that equal 1 for object
attributes i but equal 0 for object
j
q
s
number of number of attributes
attributes that that equal 0 for object
equal 1 for both i but equal 1 for object
j
objects i and j t
number of attributes
that equal 0 for both
objects i and j
Proximity Measures for Binary
Attributes

Symmetric binary attributes

Asymmetric binary attributes

A B C D E
Ali 1 0 0 1 0
Ahmed 1 1 0 0 1
Asymmetric Binary Similarity
A.k.a. Jaccard Coefficient
Asymmetric Binary Dissimilarity

 Name is Object Identifier and only Gender is a


Symmetric Binary Attribute
Asymmetric Binary Dissimilarity
Mary
1 0
Jim q= r=
1
0 s= t=

Dissimilarity based only on the asymmetric attributes


Dissimilarity of Numeric Data

Euclidean Distance

Manhattan Distance

Supremum Distance

Minkowski Distance
Euclidean Distance
Straight Line Distance

“as the crow flies” Distance


Euclidean Distance
Given two objects i & j described by p
numeric attributes
i = (xi1, xi2, … , xip)
j = (xj1, xj2, … , xjp)

i = (2, 4, 8)
j = (4, 3, 4)
d(i,j) = sqrt(4+1+16) = 4.58
Manhattan Distance
Given two objects i & j described by p
numeric attributes
i = (xi1, xi2, … , xip)
j = (xj1, xj2, … , xjp)

(2, 4, 8)
(4, 3, 4)
Manhattan D = |2-4| + |4-3| + |8-
4| = 7
Manhattan Distance
City Block Distance

https://www.google.com/maps/place/Manhattan,
+New+York,+NY,+USA/
Supremum Distance
Euclidean, Manhattan, and
Supremum distances
Supremum distance
= max ( |1-3|, |2-5|)
= max ( 2, 3)
=3
Minkowski Distance
A generalization of the Euclidean and
Manhattan distances

Manhattan Distance
h = 1
Euclidean Distance
h = 2
Minkowski Distance
A generalization of the Euclidean and
Manhattan distances

Where, h is a real number and h ≥ 1

Also called Lp norm in literature


But p here represents h in the above formula
NOT the number of attributes
Minkowski Distance
A generalization of the Euclidean and
Manhattan distances

Also called Lp norm in literature


But p represents h in the above formula (and
NOT the number of attributes)

Hence, Manhattan distance is L1 norm and


Euclidean distance L2 norm
Supremum Distance
Also referred to as
Lmax Norm
L∞ Norm or Uniform Norm
Chebyshev Distance
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
2 x1
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
160 2 4 x4 3 1 5 0
3
Weighted Distances
E.g. Weighted Euclidean Distance
Homework and
Announcement
Read Chapter 2 of the text book
Do activities, draw plots
Get ready for a Quiz in next class

You might also like