0% found this document useful (0 votes)

8 views23 pages

Datalec 1

Chapter 2 of 'Data Mining: Concepts and Techniques' discusses the fundamental aspects of data, including data objects, attribute types, and basic statistical descriptions. It covers various types of data sets, important characteristics of structured data, and methods for measuring central tendency such as mean, median, mode, and midrange. The chapter emphasizes understanding data through visualization and similarity measurements.

Uploaded by

agents0209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views23 pages

Datalec 1

Uploaded by

agents0209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Mining:

Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign
Simon Fraser University
©2013 Han, Kamber, and Pei. All rights reserved.
1
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images 1 Bread, Coke, Milk
 Temporal data: time-series
2 Beer, Bread
 Sequential Data: transaction sequences
3 Beer, Coke, Diaper, Milk
 Genetic sequence data
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:
3
Important Characteristics of Structured Data

 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution
 Patterns depend on the scale
 Distribution
 Centrality and dispersion

4
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
5
Attributes
 Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Ordinal

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled
6
Attribute Types
 Nominal:
 Nominal means “relating to names.”
 The values of a nominal attribute are symbols or “names of things”.
 Each value represents some kind of category, code, or state.
 So nominal attributes are also referred to as categorical.
 The values do not have any meaningful order.
 Hair_color = { black, brown, grey, red, white}
 Occupation = {teacher, dentist, programmer, farmer }
 It is possible to represent the values of as symbols with numbers.
 With hair color, we can assign a code of 0 for black, 1 for brown, and so on.
 Another example is customor ID, with possible values that are all numeric.
 In such cases, the numbers are not intended to be used quantitatively.
 Mathematical operations on values of nominal attributes are not meaningful.
 A nominal attribute may have integers as values, it is not considered as a
numeric attribute because the integers are not meant to be used
quantitatively.
7
Attribute Types
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Binary attributes are referred to as Boolean if the two states
correspond to true and false.
 Symmetric binary:
 its states are equally valuable and carry the same weight
 There is no preference on which outcome should be coded as 0 or 1.
 e.g., gender
 Asymmetric binary:
 The outcomes of the states are not equally important,
 We code the most important outcome, which is usually the rarest one,
by 1 and the other by 0.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)

8
Attribute Types
 Ordinal
 An attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is
not known.
 Size = {small, medium, large}
 Grade = (e.g., A+, A, A-, B+, and so on)
 Ordinal attributes are useful for registering subjective assessments of

qualities.
 Cannot be measured objectively.

 Ordinal attributes are often used in surveys for ratings.

 Nominal , binary, and ordinal attributes are qualitative.

 They describe a feature of an object without giving an actual size or
quantity.
 The values of such qualitative attributes are typically words representing
categories.

9
Numeric Attribute Types
 A numeric attribute is quantitative.
 It is a measurable quantity, represented in integer or real values.
 Numeric attributes can be interval-scaled or ratio-scaled.
 Interval-scaled
 Measured on a scale of equal-sized units.

 The values have order and can be positive, 0, or negative.

 provides a ranking of values, Compare and quantify the

difference between values.
 The outdoor temperature value for a number of different days.
 By ordering the values, we obtain a ranking of the objects with
respect to temperature.
 We can quantify the difference between values.
 For example, a temperature of 20˚ C is five degrees higher than a
temperature of 15˚C.

10
Numeric Attribute Types
 Calendar dates are another example. For instance, the
years 2002 and 2010 are eight years apart.
 Temperatures in Celsius and Fahrenheit do not have a
true zero-point, that is, neither 0˚C nor 0˚ indicates “no
temperature.”
 Ratio-scaled
 Inherent zero-point
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

11
Discrete vs. Continuous Attributes
 Classification algorithms developed often talk of attributes as
being either discrete or continuous.
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete attributes

 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented

using a finite number of digits
 Continuous attributes are typically represented as floating-
point variables 12
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

13
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency, variation
and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
14
Measuring the Central Tendency
 Various ways to measure the central tendency of data.
 We have some attribute X, like salary, which has been
recorded for a set of objects.
 Let x1,x2, : : : ,xN be the set of N observed values or
observations for X.
 These values may also be referred to as the data set.
 If we were to plot the observations for salary, where would
most of the values fall?
 This gives us an idea of the central tendency of the data.
 Measures of central tendency include the mean, median,
mode, and midrange.

15
MEAN
 The most common and effective numeric measure of
the “center” of a set of data is the (arithmetic) mean.
 Let x1,x2, : : : ,xN be a set of N values or observations,
such as for some numeric attribute X, like salary.
 The mean of this set of values is

1 x1  x 2  ...  xN
n
x   xi 
N i 1 N

16
MEAN
 Sometimes, each value xi in a set may be associated with a
weight wi for i = 1, … ,N.
 The weights reflect the significance, importance, or occurrence
frequency attached to their respective values.
 In this case, we can compute
n

w x i i
w 1x1  w 2 x 2  ...  w N x N
x i 1

n
w 1  w 2  ...  w N
w
i 1
i

 This is called the weighted arithmetic mean or the weighted

average.
17
MEAN
 A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values.
 Even a small number of extreme values can corrupt the mean.
 For example, the mean salary at a company may be substantially

pushed up by that of a few highly paid managers.

 Similarly, the mean score of a class in an exam could be pulled down

quite a bit by a few very low scores.

 To offset the effect caused by a small number of extreme values, we can
instead use the trimmed mean.
 which is the mean obtained after chopping off values at the high and
low extremes.
 For example, we can sort the values observed for salary and remove

the top and bottom 2% before computing the mean.

 We should avoid trimming too large a portion (such as 20%) at both

ends, as this can result in the loss of valuable information.

18
MEDIAN
 The data are already sorted in increasing order.
 If there is an even number of observations (i.e., 12); the median is not
unique.
 Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
 It can be any value within the two middlemost values of 52 and 56.
 By convention, we assign the average of the two middlemost values as the
median; that is, (52+56) / 2 = 54.
 The median is $54,000.
 Suppose that we had only the first 11 values in the list. Given an odd
number of values, the median is the middlemost value. This is the sixth
value in this list, which has a value of $52,000.
 The median is expensive to compute when we have a large number of
observations.
 For numeric attributes, however, we can easily approximate the value.
19
MEDIAN
 If that data are grouped in intervals according to their xi data values and that the
frequency of each interval is known.
 For example, employees may be grouped according to their annual salary in
intervals such as $10–20,000, $20–30,000, and so on.
 Let the interval that contains the median frequency be the median interval.

 We can approximate the median of the entire data set (e.g., the median salary)
by interpolation using the formula

n / 2  ( freq )l
median  L1  ( ) width
freq m edian
 where L1 is the lower boundary of the median interval.
 N is the number of values in the entire data set.
 (∑ freq )l is the sum of the frequencies of all of the intervals that are lower than
the median interval.
 freqmedian is the frequency of the median interval.
 width is the width of the median interval. 20
MODE
 The mode is another measure of central tendency.
 The mode for a set of data is the value that occurs most frequently in
the set.
 Therefore, it can be determined for qualitative and quantitative
attributes.
 It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode.
 Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal.
 A data set with two or more modes is multimodal.
 If each data value occurs only once, then there is no mode.
 Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, 110.
 The two modes are $52,000 and $70,000.
21
MIDRANGE
 The midrange can also be used to assess the central tendency of a
numeric data set.
 It is the average of the largest and smallest values in the set.
 This measure is easy to compute using the SQL aggregate functions,
max() and min().
 The midrange of the data of Example is ( 30,000 + 110,000 ) / 2 =
$70,000.
 In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same center
value.
 Data in most real applications are not symmetric.
 They may instead be either positively skewed, where the mode
occurs at a value that is smaller than the median or negatively
skewed, where the mode occurs at a value greater than the median.
22
Symmetric vs. Skewed Data

 Median, mean and mode of symmetric

symmetric, positively and negatively
skewed data

positively skewed negatively skewed

February 23, 2015 Data Mining: Concepts and Techniques 23

Zadeh L.A. Fuzzy Sets 1965
No ratings yet
Zadeh L.A. Fuzzy Sets 1965
16 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
02data Part1
No ratings yet
02data Part1
19 pages
CH 2
No ratings yet
CH 2
35 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
02 Data
No ratings yet
02 Data
47 pages
Data ch2
No ratings yet
Data ch2
16 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
Lect 3
No ratings yet
Lect 3
51 pages
02 Data
No ratings yet
02 Data
35 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
DS Handout 4
No ratings yet
DS Handout 4
4 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Lect 2
No ratings yet
Lect 2
77 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
CH 2
No ratings yet
CH 2
68 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
02 Data
No ratings yet
02 Data
24 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Full
No ratings yet
Full
367 pages
02 Data
No ratings yet
02 Data
62 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
Ids Unit-Ii
No ratings yet
Ids Unit-Ii
44 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Types: Getting Started With Statistics
From Everand
Data Types: Getting Started With Statistics
Lee Baker
No ratings yet
DLL Matatag - Mathematics 7 Q4 W2
No ratings yet
DLL Matatag - Mathematics 7 Q4 W2
13 pages
Introduction To Daa
No ratings yet
Introduction To Daa
24 pages
A Level Further Mathematics For AQA - Student Book 1
50% (2)
A Level Further Mathematics For AQA - Student Book 1
31 pages
Assignment - 13-BT&PC&P (01.05
No ratings yet
Assignment - 13-BT&PC&P (01.05
9 pages
Class 11 Mathematics DPP With Solution Chapter 4 Complex Numbers
No ratings yet
Class 11 Mathematics DPP With Solution Chapter 4 Complex Numbers
50 pages
Society of Actuaries/Casualty Actuarial Society: Exam C Construction and Evaluation of Actuarial Models
No ratings yet
Society of Actuaries/Casualty Actuarial Society: Exam C Construction and Evaluation of Actuarial Models
83 pages
Eureka Math Parent Tips Grade 4 Module 2
No ratings yet
Eureka Math Parent Tips Grade 4 Module 2
2 pages
11TH 12TH Class 2021 Punjab Board Pairing Scheme
No ratings yet
11TH 12TH Class 2021 Punjab Board Pairing Scheme
16 pages
SEEL4343 Chapter 5
No ratings yet
SEEL4343 Chapter 5
21 pages
Lecture 8
No ratings yet
Lecture 8
15 pages
Comparison of Several Multivariate Means
50% (2)
Comparison of Several Multivariate Means
103 pages
If Hamilton Had Prevailed Quaternions in Physics
No ratings yet
If Hamilton Had Prevailed Quaternions in Physics
9 pages
Mathematics Class 9 Syllabus Break Up AY 2022-23
No ratings yet
Mathematics Class 9 Syllabus Break Up AY 2022-23
5 pages
A.2.2Solving Absolute Values
No ratings yet
A.2.2Solving Absolute Values
7 pages
Answer:c: Loop Loo P
No ratings yet
Answer:c: Loop Loo P
37 pages
Forced Response
No ratings yet
Forced Response
8 pages
562-2013-11-11-E4Paper - UNBLINDED - PDF
No ratings yet
562-2013-11-11-E4Paper - UNBLINDED - PDF
53 pages
Mathematics II 2078 Questionpaper
No ratings yet
Mathematics II 2078 Questionpaper
2 pages
Class 9 Math Aptitude Test
No ratings yet
Class 9 Math Aptitude Test
3 pages
What Is The Electron
100% (4)
What Is The Electron
288 pages
Chapter5v2 0
No ratings yet
Chapter5v2 0
25 pages
Control Fpwin Pro Fp0fp1fp-m Programming - Acgm0130v3.2end
No ratings yet
Control Fpwin Pro Fp0fp1fp-m Programming - Acgm0130v3.2end
568 pages
VECTOR 5 - Dot Product
No ratings yet
VECTOR 5 - Dot Product
5 pages
CV (Sagnik) PDF
100% (2)
CV (Sagnik) PDF
3 pages
MMP ppt-1
No ratings yet
MMP ppt-1
5 pages
Matrices (Form 5)
100% (2)
Matrices (Form 5)
35 pages
Cryptography Project 001
No ratings yet
Cryptography Project 001
24 pages
Permutation Combination - JEE Main 2023 April Chapterwise PYQ - MathonGo
No ratings yet
Permutation Combination - JEE Main 2023 April Chapterwise PYQ - MathonGo
9 pages
Rigid Dynamics Vol-II (Analytical Dynamics)
100% (2)
Rigid Dynamics Vol-II (Analytical Dynamics)
404 pages

Datalec 1

Uploaded by

Datalec 1

Uploaded by

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

 Data sets are made up of data objects.

 Ordinal attributes are often used in surveys for ratings.

 Nominal , binary, and ordinal attributes are qualitative.

 The values have order and can be positive, 0, or negative.

 provides a ranking of values, Compare and quantify the

 E.g., zip codes, profession, or the set of words in a

 Note: Binary attributes are a special case of discrete attributes

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 This is called the weighted arithmetic mean or the weighted

pushed up by that of a few highly paid managers.

quite a bit by a few very low scores.

the top and bottom 2% before computing the mean.

ends, as this can result in the loss of valuable information.

 Median, mean and mode of symmetric

positively skewed negatively skewed

February 23, 2015 Data Mining: Concepts and Techniques 23

You might also like