0% found this document useful (0 votes)
27 views31 pages

Ragb Alllnkg Kyoulltherrdz: in Structor

Data mining

Uploaded by

sh1637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views31 pages

Ragb Alllnkg Kyoulltherrdz: in Structor

Data mining

Uploaded by

sh1637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

ragb alllnkg kyoullthErrdz

;41I • • • 4 1 $ It
1 M 1 1 • P k I t ' ,
411;

01 I t lt AmIR
S I g o
4 4
It. 1 • v

In structor: 51111P,1111 0t .a1 6• \ PAW


11111/‘. d
VP I O N A 4

trfNE1,— AtistAPI
Dr. Mohamed H. Farrag DATA MINING
Concepts and Techniques

M . 2 1 1 Jiawei Hon I Micheline Komber I Jion Pei

Instructor: Dr. Mohamed H. Farrag 1 C o u r s e : Data Mining Ch2: Getting to Know \bit Datac i i r ,
TwdbocoA
Main textbook,
oO81a Nfikkcy Ceflcopb eflicl TschIgnosw (3rd ed.)
Jiawei Han, Micheline Kamber, and „Ilan Pei p- 4 - 4 tile.
V. I
University of Illinois at Urbana-Champaign & •—•;„.,;"f
a
DATA MININ
Simon Fraser University

O k aTOcollos© r a g a M a g e [ V E 2nd Edition

Tan, Steinbach,
Karpatne, Kumar

Modified for Introduction to Data Mining by Dr. Mohamed H. Farrag

Instructor: Dr. Mohamed H. Farrag 2 C o u r s e : Data Mining Ch2: Getting to Know Your Data
ChapReT
rip - • • • • 4 1 0
'IA • 4 1 0 1
104. " d a b *
"Is 4S O h ill*
R

oGetting to Know Your Data 4


IP 1
'1111 t a
• 'SI 4
11,111&. At6 goole\eispby "11
V 1I 0' 4•

DATA MINING
Concepts a n d Techniques

M . 2 1 1 Jiawei Hon I Micheline Komber I Jion Pei

Instructor: Dr. Mohamed H. Farrag 3 C o u r s e : Data Mining Ch2: Getting to Know \bit Datac i i r ,
CilvApb,7 2 LE g i l g e 0 IRJECTIIVES
Getting to Know Your Data
• A t t r i b u t e s and Objects

• T y p e s of Data

• D a t a Quality

• S i m i l a r i t y and Distance

• D a t a Preprocessing

[
-
Instructor: Dr. Mohamed H. Farrag 4 C o u r s e : Data Mining Ch2: Getting to Know Your Dataa T t ,
Getting to Know Your Data

Instructor: Dr. Mohamed H. Farrag 5 C o u r s e : Data Mining Ch2: Getting to Know \bit Data o b
What is Data?
Attributes
• Collection of data objects and
their attributes
7 d Refirid Marital Taxatie
• A n attribute is a property or Status Incorre Cheat
characteristic of an object 1 Yes Single 125K No
—Examples: eye color of a person,
2 No Ma-ried 100K No
temperature, etc.
3 No Single 70K No
—Attribute i s a l s o k n o w n a s
variable, f i e l d , characteristic, (1)
• ••.••b
4 Yes Maried 120K No
dimension, or feature 5 No Di \orced 95K Yes

• A collection o f attributes 6 No Maned 60K No

describe an object 7 Yes Di wirced 220K No

—Object is also known as record, 8 No Single 85K Yes


point, case, sample, entity, o r 9 No MaTiecl 75K No
instance 10 No Single 90K Yes
Types of Attributes
• N o m i n a l Examples: ID numbers e y e color, zip codes, Hair color = (auburn,
black, blond, brown, grey, red, white), marital status
• O r d i n a l Examples: grades. height {tall, medium short}
—Values have a meaningful order (ranking)
—Size = {small medium large}, grades army rankings
• I n t e r v a l Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• M e a s u r e d on a scale of equal-sized units
• V a l u e s have order
—E . g . , temperature in C or F:, calendar dates
• N o true zero-point
• R a t i o Examples: temperature in Kelvin length, time counts and monetary
quantities
• I n h e r e n t zero-point
• W e can speak of values as being an order of magnitude larger than the
unit of measurement (10 K: is twice as high as 5 K:).

Instructor: Dr. Mohamed H. Farrag 7 C o u r s e : Data Mining Ch2: Getdng to Know Your Data C . ,
Or
Properties of Attribute Values
• T h e type of an attribute depends on which o f the following
properties/operations it possesses:
—Distinctness:
—Order: < >

—Differences are +-
meaningful :
—Ratios are /
meaningful

- Nominal attribute: distinctness


- Ordinal attribute: distinctness & order
- Interval attribute: distinctness, o r d e r & m e a n i n g f u l
differences
- Ratio attribute: all 4 properties/operations

Instructor: Dr. Mohamed H. Farrag 8 C o u r s e : Data Mining Ch2: Getting to Know Your Data C i f ,
Properties of Attribute Values
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color. sex: {male, correlation, z2
female) test

Ordinal Ordinal attribute hardness of minerals, median.


values also order {good, better. best). percentiles, rank
objects. grades, street correlation. run
(<7 >) numbers tests, sign tests
Interval For interval calendar dates. mean. standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
values are correlation. t and
meaningful. (+. - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age. mass, percent variation
meaningful. (* /) length. current
This categorization of attributes is due to S. S. Stevens
Properties of Attribute Values
Attribute Transformation
Quantitative Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new value = kold value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new value = a * old value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
differ in terms of where their
zero value is and the size of a
unit (degree).
Ratio new value = a * old va/ue Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens


Discrete and Continuous Attributes
• Discrete Attribute
—Has only a finite or countably infinite set of values
—Examples: zip codes, counts, or the set of words in a
collection of documents
—Often represented as integer variables.
—Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
—Has real numbers as attribute values
—Examples: temperature, height, or weight.
—Practically, real values can only be measured and
represented using a finite number of digits.
—Continuous attributes are typically represented as floating-
point variables.

Instructor: Dr. Mohamed H. Farrag 1 1 C o u r s e : Data Mining Ch2: Getting to Know Your Data
(61:)
More Complicated Examples
• I D numbers
—Nominal, ordinal,or interval?

• Number of cylinders in an automobile engine


—Nominal, ordinal, or ratio?

• Biased Scale
- Interval or Ratio

Instructor: Dr. Mohamed H. Farrag 12 C o u r s e : Data Mining Ch2: Getting to Know Your Data ( a ) _
Types of data sets
• Record
—D a t a Matrix
—Document Data
—Transaction Data
• Graph
—World Wide Web
—Molecular Structures
• Ordered
—Spatial Data
—Temporal Data
—Sequential Data
—Genetic Sequence Data
eSpatial, image and multimedia:
—Spatial data: maps
—Image data:
—Video data:

Instructor: Dr. Mohamed H. Farrag 13 C o u r s e : Data Mining Ch2: Getting to Know Your Data C r
O
Important Characteristics of Data
—Dimensionality
• number of attributes for the objects in the data set
• High dimensional data brings a number of challenges
• Curse of dimensionality ( the difficulties associated with analyzing
high —dimension data)
—Sparsity
• m o s t attributes of an object have values of 0
• Fewer than 1% the entries non zero
—Resolution
• i t is the frequently possible to obtain data at different levels of
resolution.
—Size
• Ty p e of analysis may depend on size of data

Instructor: Dr. Mohamed H. Farrag 14 C o u r s e : Data Mining Ch2: Getting to Know Your Data C o t , _
Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
T id efund Marital Ta x a b l e
Status Income Cheat

1 Ye s S in g l e 125K No
2 No M a rrie d 1OOK No
3 No S in g l e 70K No
4 Ye s M a rrie d 120K No
5 No D iv o rc e d 95K Ye s
6 No M a rrie d 60K No
7 Ye s D iv o rc e d 220K No
8 No S in g l e 85K Yes
9 No M a rrie d 75K No
10 No S in g l e 9 OK Yes

Instructor: Dr. Mohamed H. Farrag 1 5 C o u r s e : Data Mining Ch2: Getting to Know Your Data
(10
Data Matrix
• I f data objects have t h e same fixed s e t o f numeric
attributes, then the data objects can be thought of as points
in a multi-dimensional space, w h e r e e a c h dimension
represents a distinct attribute

• Such data set can be represented by an m by n matrix,


where there a r e m rows, o n e f o r each object, and n
columns(observations), one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

Instructor: Dr. Mohamed H. Farrag 16 C o u r s e : Data Mining Ch2: Getting to Know Your Data O D _
Document Data
• Each document becomes a 'term' vector
—Each term is a component (attribute) of the vector
—The value of each component is the number of times the
corresponding term occurs in the document.

coach

7:5
ET
a)
•---::
co
= 3 5 ci) CD
O
CD E

score
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

Instructor: Dr. Mohamed H. Farrag 17 C o u r s e : Data Mining Ch2: Getting to Know Your Data ( a ) _
Transaction Data
• A special type of record data, where
—Each record (transaction) involves a set of items.
—For example, consider a grocery store. T h e set of products
purchased b y a customer during o n e shopping t r i p
constitute a transaction, while the individual products that
were purchased are the items.

TM Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Instructor: Dr. Mohamed H. Farrag 1 8 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Graph Data
• Examples: Generic graph, a molecule, and webpages i

Useful Links:
• hagiography
Knowledge Discovery and
• Other Useltd Web sites
Data Mining Bibliography
IGets updated trequenils. so visit often!
o A C M SIGKDD

O KDIIIttlEgetS • B a k

o T h e Data Mine • General Data Mimic

Book Iteferences in 1)ata 1iiiiiig Ind


Knouledge I)iscosery / [ h i A 11.ala \ l i n i n g

Usama Fayyad. Gregory Piatetsky-Shapim. I.mama Fayyad. -Mining Databases: Towards


Padhraic Smyth. and Raniasamy uthurasain).
Nigorithms for Knowledge Discovery'. Bullelin of
"Advances in Knowledge Discovery and Data
he IEEE Computer Society Technical Committee
Mining". AAAI Press/the MIT Press. 1996. ,in data Engineering. vol. 21. no. 1. March I998.
J. Ross Quinlan. "C4.5: Pmgrams for Machine
Learning". Morgan Kaufmann Publishers. 1993 Christopher Matheus. Philip Chan. and Gregory
Michael Berry and Gordon Linoff. "Data Mining hatetsky-Shapiro. -Systems for knowledge
Techniques i For Marketing. Sales. and Customer I nscovery in databases. IEEE Transactions on
Support). John Wiley & Sons. 1997. Kno• ledge and Data Engineering. 5(6)903-913.
1)eceniller 1993.
Benzene Molecule: 06H6
Instructor: Dr. Mohamed H. Farrag 19 C o u r s e : Data Mining Ch2: Getdng to Know Your Data C ,
Or
Ordered Data
• Sequences of transactions

Items/Events

( A B) ( D ) (C E)
B D) ( C ) (E)
C D) ( B ) (A E)

An element of
the sequence
Instructor: Dr. Mohamed H. Farrag 2 0 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Ordered Data

• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACAC GC GAAGC GC
TGGGCTGCCTGCTGCGACCAGGG

Instructor: Dr. Mohamed H. Farrag 21 C o u r s e : Data Mining Ch2: Getting to Know Your Datac a , _
Ordered Data

• Spatio-Temporal Data
Jan

Average Monthly
Temperature of
land and ocean

Instructor: Dr. Mohamed H. Farrag 2 2 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Data Quality
• P o o r data quality negatively affects many data processing
efforts

—Poor data quality costs the typical company at least ten percent
(10%) of revenue; twenty percent (20%) is probably a better
estimate."

Thomas C. Redman D M Review, August 2004

• D a t a mining example: a classification model for detecting


people who are loan risks is built using poor data
—Some credit-worthy candidates are denied loans
—More loans are given to individuals that default

Instructor: Dr. Mohamed H. Farrag 23 C o u r s e : Data Mining Ch2*. Getting to Krsow Yotr Data
Data Quality
• W h a t kinds of data quality problems?
• H o w can we detect problems with the data?
• W h a t can we do about these problems?

• Examples of data quality problems:


- Noise and outliers
- Missing values
- Duplicate data
- Wrong data

Instructor: Dr. Mohamed H. Farrag 2 4 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Noise
• F o r objects, noise is an extraneous object
• F o r attributes, noise refers to modification of original values
—Examples: distortion of a person's voice when talking on a poor phone
and -snow" on television screen

15
1
10
0.5

010
5

-5
-0.5
10

1%
0.4 0 . 6 0 . 0.2 0.4 0 . 6 0 . 8
Time (seconds) T i m e (seconds)

Two Sine Waves T w o Sine Waves + Noise


Instructor: Dr. Mohamed H. Farrag 2 5 C o u r s e : Data Mining Ch2: Getting to Know Your Data
Cas
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set
- Case 1: Outliers are
noise that interferes
with data analysis

- Case 2: Outliers are •

the goal of our analysis •


• • r .

• •

• •
• 3.1:11;17Fol. • • •

• Credit card fraud


• Intrusion detection •
•:;;It•tf,"_Lbci
Causes? 0
•t41., t‘24.17$

C41
•••• . 4 •

Instructor: Dr. Mohamed H. Farrag 26 C o u r s e : Data Mining Ch2: Getting to Know Your Data ( j )
Missing Values
• Reasons for missing values
- Information is not collected
(e.g., people decline to give their age and weight)
- Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


- Eliminate data objects or variables
- Estimate missing values
• Example: time series of temperature
• Example: census results
- Ignore the missing value during analysis

Instructor: Dr. Mohamed H. Farrag 27 C o u r s e : Data Mining Ch2: Getting to Know Your Data ( a ) _
Duplicate Data
• Data set may include data objects that are duplicates,
or almost duplicates of one another
—Major issue when merging data from heterogeneous
sources

Examples:
—Same person with multiple email addresses

Data cleaning
—Process of dealing with duplicate data issues

Instructor: Dr. Mohamed H. Farrag 28 C o u r s e : Data Mining Ch2: Getting to Know Your Data C o t , _
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


—Accuracy: correct or wrong, accurate or not
—Completeness: not recorded, unavailable,
—Consistency: some modified but some not, dangling,
—Timeliness: timely update?
—Believability: how trustable the data are correct?
—Interpretability: how easily the data can be understood?

Instructor: Dr. Mohamed H. Farrag 29 C o u r s e : Data Mining Ch2: Getting to Know Your Data C i f ,
Summary
• D a t a attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled
• M a n y types of data sets, e.g., numerical, text, graph, Web, image.
• G a i n insight into the data by:
—Basic statistical data description: central tendency, dispersion, graphical
displays
—Data visualization: map data onto graphical primitives
—Measure data similarity
• A b o v e steps are the beginning of data preprocessing.
• M a n y methods have been developed but still an active area of research.

Instructor: Dr. Mohamed H. Farrag 3 0 C o u r s e : Data Mining Ch2: Getting to Know Your Data
References
• W . Cleveland, Visualizing Data, Hobart Press, 1993
• T . Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
• U . Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Miningand
Knowledge Discovery, Morgan Kaufmann, 2001
• L . Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
• H . V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
• D . A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
• D . Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• S . Santini and R. Jain," Similarity measures", IEEE Trans. on Pattern Analysis and Machine
Intelligence, 21(9), 1999
• E . R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001
• C . Yu, et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009

[ Instructor: Dr. Mohamed H. Farrag 3 1 C o u r s e : Data Mining Ch2: Getting to Know Your Data

You might also like