Ragb Alllnkg Kyoulltherrdz: in Structor
Ragb Alllnkg Kyoulltherrdz: in Structor
;41I • • • 4 1 $ It
1 M 1 1 • P k I t ' ,
411;
01 I t lt AmIR
S I g o
4 4
It. 1 • v
trfNE1,— AtistAPI
Dr. Mohamed H. Farrag DATA MINING
Concepts and Techniques
Instructor: Dr. Mohamed H. Farrag 1 C o u r s e : Data Mining Ch2: Getting to Know \bit Datac i i r ,
TwdbocoA
Main textbook,
oO81a Nfikkcy Ceflcopb eflicl TschIgnosw (3rd ed.)
Jiawei Han, Micheline Kamber, and „Ilan Pei p- 4 - 4 tile.
V. I
University of Illinois at Urbana-Champaign & •—•;„.,;"f
a
DATA MININ
Simon Fraser University
Tan, Steinbach,
Karpatne, Kumar
Instructor: Dr. Mohamed H. Farrag 2 C o u r s e : Data Mining Ch2: Getting to Know Your Data
ChapReT
rip - • • • • 4 1 0
'IA • 4 1 0 1
104. " d a b *
"Is 4S O h ill*
R
DATA MINING
Concepts a n d Techniques
Instructor: Dr. Mohamed H. Farrag 3 C o u r s e : Data Mining Ch2: Getting to Know \bit Datac i i r ,
CilvApb,7 2 LE g i l g e 0 IRJECTIIVES
Getting to Know Your Data
• A t t r i b u t e s and Objects
• T y p e s of Data
• D a t a Quality
• S i m i l a r i t y and Distance
• D a t a Preprocessing
[
-
Instructor: Dr. Mohamed H. Farrag 4 C o u r s e : Data Mining Ch2: Getting to Know Your Dataa T t ,
Getting to Know Your Data
Instructor: Dr. Mohamed H. Farrag 5 C o u r s e : Data Mining Ch2: Getting to Know \bit Data o b
What is Data?
Attributes
• Collection of data objects and
their attributes
7 d Refirid Marital Taxatie
• A n attribute is a property or Status Incorre Cheat
characteristic of an object 1 Yes Single 125K No
—Examples: eye color of a person,
2 No Ma-ried 100K No
temperature, etc.
3 No Single 70K No
—Attribute i s a l s o k n o w n a s
variable, f i e l d , characteristic, (1)
• ••.••b
4 Yes Maried 120K No
dimension, or feature 5 No Di \orced 95K Yes
Instructor: Dr. Mohamed H. Farrag 7 C o u r s e : Data Mining Ch2: Getdng to Know Your Data C . ,
Or
Properties of Attribute Values
• T h e type of an attribute depends on which o f the following
properties/operations it possesses:
—Distinctness:
—Order: < >
—Differences are +-
meaningful :
—Ratios are /
meaningful
Instructor: Dr. Mohamed H. Farrag 8 C o u r s e : Data Mining Ch2: Getting to Know Your Data C i f ,
Properties of Attribute Values
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color. sex: {male, correlation, z2
female) test
Instructor: Dr. Mohamed H. Farrag 1 1 C o u r s e : Data Mining Ch2: Getting to Know Your Data
(61:)
More Complicated Examples
• I D numbers
—Nominal, ordinal,or interval?
• Biased Scale
- Interval or Ratio
Instructor: Dr. Mohamed H. Farrag 12 C o u r s e : Data Mining Ch2: Getting to Know Your Data ( a ) _
Types of data sets
• Record
—D a t a Matrix
—Document Data
—Transaction Data
• Graph
—World Wide Web
—Molecular Structures
• Ordered
—Spatial Data
—Temporal Data
—Sequential Data
—Genetic Sequence Data
eSpatial, image and multimedia:
—Spatial data: maps
—Image data:
—Video data:
Instructor: Dr. Mohamed H. Farrag 13 C o u r s e : Data Mining Ch2: Getting to Know Your Data C r
O
Important Characteristics of Data
—Dimensionality
• number of attributes for the objects in the data set
• High dimensional data brings a number of challenges
• Curse of dimensionality ( the difficulties associated with analyzing
high —dimension data)
—Sparsity
• m o s t attributes of an object have values of 0
• Fewer than 1% the entries non zero
—Resolution
• i t is the frequently possible to obtain data at different levels of
resolution.
—Size
• Ty p e of analysis may depend on size of data
Instructor: Dr. Mohamed H. Farrag 14 C o u r s e : Data Mining Ch2: Getting to Know Your Data C o t , _
Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
T id efund Marital Ta x a b l e
Status Income Cheat
1 Ye s S in g l e 125K No
2 No M a rrie d 1OOK No
3 No S in g l e 70K No
4 Ye s M a rrie d 120K No
5 No D iv o rc e d 95K Ye s
6 No M a rrie d 60K No
7 Ye s D iv o rc e d 220K No
8 No S in g l e 85K Yes
9 No M a rrie d 75K No
10 No S in g l e 9 OK Yes
Instructor: Dr. Mohamed H. Farrag 1 5 C o u r s e : Data Mining Ch2: Getting to Know Your Data
(10
Data Matrix
• I f data objects have t h e same fixed s e t o f numeric
attributes, then the data objects can be thought of as points
in a multi-dimensional space, w h e r e e a c h dimension
represents a distinct attribute
Instructor: Dr. Mohamed H. Farrag 16 C o u r s e : Data Mining Ch2: Getting to Know Your Data O D _
Document Data
• Each document becomes a 'term' vector
—Each term is a component (attribute) of the vector
—The value of each component is the number of times the
corresponding term occurs in the document.
coach
7:5
ET
a)
•---::
co
= 3 5 ci) CD
O
CD E
score
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Instructor: Dr. Mohamed H. Farrag 17 C o u r s e : Data Mining Ch2: Getting to Know Your Data ( a ) _
Transaction Data
• A special type of record data, where
—Each record (transaction) involves a set of items.
—For example, consider a grocery store. T h e set of products
purchased b y a customer during o n e shopping t r i p
constitute a transaction, while the individual products that
were purchased are the items.
TM Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Instructor: Dr. Mohamed H. Farrag 1 8 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Graph Data
• Examples: Generic graph, a molecule, and webpages i
Useful Links:
• hagiography
Knowledge Discovery and
• Other Useltd Web sites
Data Mining Bibliography
IGets updated trequenils. so visit often!
o A C M SIGKDD
O KDIIIttlEgetS • B a k
Items/Events
( A B) ( D ) (C E)
B D) ( C ) (E)
C D) ( B ) (A E)
An element of
the sequence
Instructor: Dr. Mohamed H. Farrag 2 0 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACAC GC GAAGC GC
TGGGCTGCCTGCTGCGACCAGGG
Instructor: Dr. Mohamed H. Farrag 21 C o u r s e : Data Mining Ch2: Getting to Know Your Datac a , _
Ordered Data
• Spatio-Temporal Data
Jan
Average Monthly
Temperature of
land and ocean
Instructor: Dr. Mohamed H. Farrag 2 2 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Data Quality
• P o o r data quality negatively affects many data processing
efforts
—Poor data quality costs the typical company at least ten percent
(10%) of revenue; twenty percent (20%) is probably a better
estimate."
Instructor: Dr. Mohamed H. Farrag 23 C o u r s e : Data Mining Ch2*. Getting to Krsow Yotr Data
Data Quality
• W h a t kinds of data quality problems?
• H o w can we detect problems with the data?
• W h a t can we do about these problems?
Instructor: Dr. Mohamed H. Farrag 2 4 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Noise
• F o r objects, noise is an extraneous object
• F o r attributes, noise refers to modification of original values
—Examples: distortion of a person's voice when talking on a poor phone
and -snow" on television screen
15
1
10
0.5
010
5
-5
-0.5
10
1%
0.4 0 . 6 0 . 0.2 0.4 0 . 6 0 . 8
Time (seconds) T i m e (seconds)
•
• • r .
• •
• •
• 3.1:11;17Fol. • • •
C41
•••• . 4 •
Instructor: Dr. Mohamed H. Farrag 26 C o u r s e : Data Mining Ch2: Getting to Know Your Data ( j )
Missing Values
• Reasons for missing values
- Information is not collected
(e.g., people decline to give their age and weight)
- Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Instructor: Dr. Mohamed H. Farrag 27 C o u r s e : Data Mining Ch2: Getting to Know Your Data ( a ) _
Duplicate Data
• Data set may include data objects that are duplicates,
or almost duplicates of one another
—Major issue when merging data from heterogeneous
sources
Examples:
—Same person with multiple email addresses
Data cleaning
—Process of dealing with duplicate data issues
Instructor: Dr. Mohamed H. Farrag 28 C o u r s e : Data Mining Ch2: Getting to Know Your Data C o t , _
Data Quality: Why Preprocess the Data?
Instructor: Dr. Mohamed H. Farrag 29 C o u r s e : Data Mining Ch2: Getting to Know Your Data C i f ,
Summary
• D a t a attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled
• M a n y types of data sets, e.g., numerical, text, graph, Web, image.
• G a i n insight into the data by:
—Basic statistical data description: central tendency, dispersion, graphical
displays
—Data visualization: map data onto graphical primitives
—Measure data similarity
• A b o v e steps are the beginning of data preprocessing.
• M a n y methods have been developed but still an active area of research.
Instructor: Dr. Mohamed H. Farrag 3 0 C o u r s e : Data Mining Ch2: Getting to Know Your Data
References
• W . Cleveland, Visualizing Data, Hobart Press, 1993
• T . Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
• U . Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Miningand
Knowledge Discovery, Morgan Kaufmann, 2001
• L . Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
• H . V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
• D . A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
• D . Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• S . Santini and R. Jain," Similarity measures", IEEE Trans. on Pattern Analysis and Machine
Intelligence, 21(9), 1999
• E . R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001
• C . Yu, et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
[ Instructor: Dr. Mohamed H. Farrag 3 1 C o u r s e : Data Mining Ch2: Getting to Know Your Data