0% found this document useful (0 votes)
118 views48 pages

Data Mining Techniques & Applications

This document provides an overview of data mining techniques and applications, including different types of data sets. It discusses how a data set typically consists of a collection of data objects or instances that are characterized by attributes and measurements. Common types of data sets include tables of records, transaction databases, data matrices, graph-based data, ordered/sequential data, and spatial data. Understanding the different data types is important for correctly applying data mining techniques and exploring and summarizing data.

Uploaded by

AzerMušinović
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views48 pages

Data Mining Techniques & Applications

This document provides an overview of data mining techniques and applications, including different types of data sets. It discusses how a data set typically consists of a collection of data objects or instances that are characterized by attributes and measurements. Common types of data sets include tables of records, transaction databases, data matrices, graph-based data, ordered/sequential data, and spatial data. Understanding the different data types is important for correctly applying data mining techniques and exploring and summarizing data.

Uploaded by

AzerMušinović
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Data Mining

Techniques & Applications

Processing & Data Exploration


Topics
 Data and Data sets
 Data Types
 Data Sources and Data Quality
 Data Pre-processing Methods
 Data Summarization
 Data Visualization
 Pattern Visualization
 Overview of OLAP
Sarajevo School of Science and Technology 2
Data
 Instance/Example/Object
◼ The input data set is normally considered as a collection of data objects or
instances or examples.
◼ Each instance is an individual and independent record of real life events.
 Attribute and Measurement
◼ A data instance is characterized by its values on a fixed, pre-defined set of features
or attributes.
◼ An attribute describes a specific property or characteristic of a data object.
◼ Measurement is a process of assigning a valid value to an attribute according to a
measurement scale.
◼ Use of appropriate measurement scales is extremely important for the correct
understanding of the attribute.
 Ex. <1002, “J.Smith”, 23, “Milton Keynes”, £230k, 20/2/95, on time>

Sarajevo School of Science and Technology 3


Types of Data Sets
Age Group Own Car Income Band Class

 Data set of records young


young
yes
no
low
low
risky
risky
middle aged yes middle risky
◼ Table of records (single middle aged
middle aged
no
yes
high
low
safe
risky
table or joining of young yes high risky

multiple tables)
middle aged no low safe
retired yes middle safe
TID Items
retired no middle safe
◼ Transaction database
100 apple, beer, newspaper
200 retired
apple, yes
beef, beer, newspaper, potato high safe
300 beef, potato

(transaction ID and set 400


500
Object Count Diameter
beef, noodles
beef, potato
Area Shape Factor Intensity x y

of items) 34
41
59
15,19
16,07
19,98
181,14
202,83
313,46
0,99
0,9
0,88
655,58
953,22
688,95
249,74
241,63
239,67
242,32
244,51
248,89

◼ Data matrix (table with


44 18,14 258,49 0,98 786,62 225,83 243,87
30 17,42 238,31 0,96 501,1 405,88 66,2
30 15,59 190,93 1,01 647,18 237,58 247,98

numeric attributes)

match

coach
game
32 15,16 180,49 0,99 624,84 233,26 263,12

play
win
DocumentID
◼ Document-term matrix 1
2
10
2
2
1
2
0
2
0
3
2
3 0 34 5 10 10
4 4 0 1 2 2

Sarajevo School of Science and Technology 4


Types of Data Sets
 Graph-based data
◼ Using graph to represent the relationship between data
objects
◼ Data object structure is represented as graphs

◼ .

Sarajevo School of Science and Technology 5


Types of Data Sets
 Ordered data
◼ Sequential (temporal) data: i.e. record + time tag
◼ Sequence data: an ordered sequence of entities without time tag
◼ Time series data: each record is a time series of measurements
◼ Spatial data:
GGTTCCGCCTTCAGCCC
CGCGCCCGCAGGG…
CID Items
1 (t1:apple, milk), (t4:newspaper)
2 (t1: apple, beef), (t3: milk, newspaper)
3 (t2: beef, potato)
4 (t4: beef, noodles)

Foci Progress
Region measured at 0.5 h measured at 1 h measured at 2 h measured at 6 h
1 72.30 33.98 30.7 10.2
2 65 32.5 26.4 12.5
3 67.8 34.3 22.1 8.4

Sarajevo School of Science and Technology 6


Types of Data Sets
 Most used form of data set is record set
 A record may directly capture the raw record data or the
extracted features of non-record raw data
 Ex. Instead of the following skin image, we measure the
number of cells, the thickness of the dermis layer, the
thickness of fat layer, and then label the record as
“diabetic”.

Fat layer

Dermis layer . Cells

Sarajevo School of Science and Technology 7


General Characteristics of Data Sets
 Size
◼ In terms of the total number of records
◼ Small (MB), medium (GB) and large (TB)
 Dimensionality
◼ Varies from data set to data set, from low to extremely high
◼ Curse of dimensions
 Sparsity
◼ Values are skewed to some extreme or sub-ranges
◼ Asymmetric (some values are more important than others)
◼ May be useful for speedy processing
 Resolution
◼ Right level of data details
◼ Related to the intended purpose
◼ Not necessarily “the more details the better”

Sarajevo School of Science and Technology 8


Data Types
 Ideally, the domain features must reflect the
properties of the attribute.
◼ Ex. attribute Age is an integer between 0 and 130
 However, this is not entirely nor always
observed.
◼ Ex. EmpID attribute is usually declared as integer
type. Integers 1001 and 1020 represent two
different employees. However, we know that adding
the two numbers or taking the average would not
make a sense.

Sarajevo School of Science and Technology 9


Data Types
 Different types:
◼ Categorical/Qualitative types (Nominal,
Ordinal)
◼ Numeric/Quantitative types (Interval, Ratio)
◼ Discrete vs. continuous attributes
◼ Domain nature, applicable operations, and
transformation

Sarajevo School of Science and Technology 10


Data Types: Nominal
 A set of names, no concept of order and difference
◼ Ex. attribute Name: {“Jones”, “Smith”, “Wang”, “Richardson”}
 Only enough information to distinguish one object from
another
 Operators applicable: =, ≠
◼ Ex. SELECT * FROM Customers WHERE name = “Richardson”
 Caution: attributes declared as numbers should not be
treated as numbers. Ex. EmpID.
 A one-to-one transformation is permissible
◼ Ex. Gender:{“Female”, “Male”} => {F, M} => {0, 1}
◼ Ex. EmpID :[1, 10] => EmpID:{A, B, C, D, E, F, G, H, X, S}

Sarajevo School of Science and Technology 11


Data Types: Ordinal
 A set of names with concept of order, without concept of
difference
◼ Ex. Temperature: {cold, warm, hot}
◼ Ex. Grades: {F, D, C, B, A}
 Operator applicable: =, ≠, <, >, ≤, ≥
◼ Ex. Cold < warm < hot
 We only know there is a difference between two values,
but we do not know by how much
◼ Ex. Hot – warm = warm – cold? Hot = 2 * warm?
 An order-preserving transformation of values is permitted
◼ Ex. Customer status: {bad, average, good, excellent} =>
Customer status: (1, 2, 3, 4)

Sarajevo School of Science and Technology 12


Data Types: Interval
 A set of numeric values with concepts of order and difference
◼ Ex. Calendar year: 1945, 1970, 2006
◼ Ex. Temperature scales °F and °C
 Operators applicable: all operators for the ordinal type plus +, -
◼ Ex. Calendar year: 1945 < 1970, 2007 – 1959 = 48
 There is no reference to an apparent absolute zero. We cannot
use operations such as * and /
◼ Ex. Is 20°C twice as warm as 10°C?
◼ Ex. Does 2007 * 2 really make sense?
 The transformation of new_value = a * old_value + b where a
and b are constants is permitted
◼ Ex. Temperature: °C = (5/9) * (°F – 32)

Sarajevo School of Science and Technology 13


Data Types: Ratio
 A set of values with concepts of order, difference and ratio
 The set has an absolute zero
 Operator applicable: all operators for categorical type plus
+, -, *, /
 The transformation of new_value = a * old_value where a
is a constant is permitted. Ex. Converting meters to feet.
 Examples:
◼ The difference between my age (30) and my niece’s age (10) is
20 years. My age is 3 times of my nieces.
◼ If the distance between Tuzla and Sarajevo is 130 km, and the
between Mostar and Tuzla is 260 km, then the distance between
MO and TZ distance between TZ and SA.

Sarajevo School of Science and Technology 14


Data Source
 Operational databases (primarily for
tactical decision making)
◼ Single table
◼ Inner/Outer joins of a number of related
tables
 External sources
◼ From partners or third-party
◼ Combined with internal data source
Sarajevo School of Science and Technology 15
Data Source
 Data warehouse
◼ An organizational database for decision making
◼ A central data repository separate from
operational systems, and upload and integrate
data from the operational systems
◼ Organization wide data consistency and data
integration
◼ Data details as well as data summarization
◼ Equipped with data analysis and reporting tools
◼ Refer to later parts of this course
Sarajevo School of Science and Technology 16
Data Quality
 Data quality, an important issue for information-based decision making,
but largely ignored by many organisations
 “Garbage-in, garbage-out”
 Evaluating data quality
◼ Accuracy
◼ Correctness
◼ Completeness
◼ Consistency
◼ Redundancy
 For data mining, “addressing quality issue at source” cannot be always
expected
◼ By data cleaning
◼ By using tolerate mining algorithms
 Quality is relevant to the intended purpose of data mining

Sarajevo School of Science and Technology 17


Data Source and Data Quality
 Measurement and Data Collection Issues w.r.t Quality
◼ Noise
 random measurement error by adding spurious objects
 Often associated with data with spatial and temporal properties
◼ Precision, bias and accuracy of measurements (when a number
of measurements are repeated)
 Precision: the closeness of the measurements to one another,
measured by the standard deviation of the measurements
 Bias: a systematic variation of measurements from the quantity
being measured, measured against the known external value
 Accuracy: the closeness of the measure to the true value, normally
indicated by the number of significant digits of the measurements

Sarajevo School of Science and Technology 18


Data Source and Data Quality
 Measurement and Data Collection Issues
w.r.t Quality
◼ Outliers
 Different from most values in the data set or
 Unusual with respect to the typical values
◼ Missing values (Null value)
 Not measured or Not available
◼ Inconsistency (within data, between data and
overall)
Sarajevo School of Science and Technology 19
Data Source and Data Quality
 Application Issues w.r.t Quality
◼ Timeliness
 Data have a limited period of time of validity
 Patterns drawn from old data may not be applicable to
the current situation
◼ Relevance
 Selection of attributes for data mining is related to the
purpose of mining
 Patterns mined may miss important attributes if not
careful
 Not all attributes are relevant to the mining task

Sarajevo School of Science and Technology 20


Data Source and Data Quality
 Application Issues w.r.t Quality
◼ Sampling bias
 Sampling can greatly reduce the search space and improve
speed of mining
 Sampling also introduces bias where not all data have fair
representation in the sample (ex. a biased lottery machine)
◼ Knowledge on data
 Documentation about data must provide sufficient correct
information about data
 Elementary data analysis (data exploration)
 Knowledge on data may help to improve the data quality

Sarajevo School of Science and Technology 21


Data Pre-processing
 Processing the data before mining starts
 Purpose: for speedy, cost-effective and high quality
outcomes of data mining
 Pre-processing tasks
◼ Data aggregation
◼ Data sampling
◼ Dimension reduction
◼ Feature selection
◼ Feature creation
◼ Discretization / binarization
◼ Variable transformation
◼ Dealing with missing values

Sarajevo School of Science and Technology 22


Data Pre-processing: Data Aggregation
 What: to reduce the number of TID Date Item Store Price Clubcard# ……

data objects or their attributes ……


32144
……
06/06/2006
……
milk
……
Buckingham
……
1.99
……
1111
……
……

by summarizing low level data 11122


11122
04/04/2006
04/04/2006
watch
battery
Buckingham
Buckingham
25.99
3.99
1011
1011
……
……
details to higher level data 11123 04/04/2006 beer Buckingham 9.99 1022 ……

abstraction
22244 04/04/2006 beer MK 6.99 1022 ……
22244 04/04/2006 nappies MK 10.89 1022 ……
23311 05/04/2006 beer MK 6.99 1011 ……

 Why: to reduce the time of …… …… …… …… …… …… ……

mining, to rescale data values,


and to discover more stable Date Store AveragePrice ……

patterns ……
06/06/2006
……
Buckingham
……
1.99
……
……

 How:
04/04/2006 Buckingham 13.32 ……
04/04/2006 MK 8.94 ……
05/04/2006 MK 6.99 ……
◼ By generalization using a given …… …… …… ……

concept hierarchy
By applying aggregate functions
Number of Items TotalPrice Clubcard# ……
◼ …… …… …… ……
(count, sum, average, etc.) 1
3
1.99
36.97
1111
1011
……
……
◼ Dropping some attributes ……
2 27.87
……
1022
……
……
……

Sarajevo School of Science and Technology 23


Data Pre-processing: Data Sampling
 What: selecting a subset of the given data set
 Why: to make it possible to use sophisticated mining algorithms within a time limit.
 Caution: the sample must be representative of the original data set
 How:
◼ Random sampling
 Sampling without replacement
 Sampling with replacement
 Not much difference when the sample size is small comparing to the original data set size
◼ Stratified sampling
 Grouping the data in the data set
 Sampling from each group
 Size of each group may be different and may influence the number of samples selected
◼ Progressive sampling
 Starting with a small sample and gradually increase its size
 Evaluate its representation of the original data set
 Stop when the representation is close enough

Sarajevo School of Science and Technology 24


Data Pre-processing:
Data Dimension Reduction
 The Problem:
◼ Data set may have many dimensions (Ex. a 120x80 pixel image)
◼ Curse of dimensionality
 As dimensionality increases, the data become increasingly sparse
 The data analysis becomes harder because the mined patterns are less significant
and peculiar.
 The time for data processing increases significantly as dimensionality increases
 Not all attributes are equally significant in describing the pattern
 Why: to reduce the effects of curse of dimensionality
 How:
◼ Linear algebra techniques
 Principal component analysis (PCA) :: A B C -> D E F
 Independent component analysis (ICA) :: A B C -> D E
 Single value decomposition (SVD) :: matrices
◼ Feature subset selection

Sarajevo School of Science and Technology 25


Data Pre-processing:
Feature Subset Selection
 What: to reduce dimensionality by selecting a subset of attributes
 Purposes:
◼ To remove or reduce redundant features (ex. derived attributes)
◼ To remove irrelevant features that contains no useful information for the data
mining task (ex. ID attribute)
 How:
◼ Use common sense and domain knowledge for manual selection
◼ Embedded approach: let the mining algorithm to select suitable features (ex.
decision induction)
◼ Filter and wrapper approaches:

attributes
Subset Not ok Stopping ok Selected Validate with
selection criterion subset Mining task

One subset evaluation

Sarajevo School of Science and Technology 26


Data Pre-processing: Feature Creation
 What: to create a new set of features from the original features
 Purpose: in the new feature space, meaningful patterns can be
extracted more easily. The number of features may be reduced.
 How:
◼ Using feature extraction methods to extract new features from the
existing ones
 Ex. extracting color, texture and shape from image of pixel values
◼ Mapping data to a new space
 Ex. wavelet/fourier transformation of pixel values of images to a frequency
domain
◼ Constructing new features from the existing ones using domain
knowledge
 Ex. using transaction dates to construct a new feature customer tenure that
indicates the loyalty of the customer

Sarajevo School of Science and Technology 27


Data Pre-processing:
Discretization / Binarization
 What: to convert continuous attribute values to discrete
categorical values, and to convert discrete categorical values to
binary Boolean attribute values
 Why:
◼ Requirement for some data mining solutions
◼ Better data mining results
 How:
◼ Binarization: convert m categorical values to [0, m-1] and
 Convert each to binary number of n bits where n = log2m
 Use m asymmetric binary variables to represent each of m values
◼ Discretization process:
 Deciding how many categories to have and where split points should
be
 Mapping values to categories: sort the values, split them into sub-
ranges and map all the values in a sub-range to the same category

Sarajevo School of Science and Technology 28


Data Pre-processing: Discretization
 How:
◼ Unsupervised methods, discretization
without a concern to the outcome of one
specific attribute, normally used to clustering
and association rule discovery
Original Data Set: {11, 11, 12, 17, 18, 18, 19, 19, 20, 22, 24, 24, 25, 26}
Equal width method: {C, C, C, T, T, T, T, T, T, Y, Y, Y, Y, A}
Equal frequency method: {C, C, C, C, T, T, T, T, T, Y, Y, Y, Y, Y}
Clustering method: {C, C, C, T, T, T, T, T, T, Y, Y, Y, Y, Y}
Sarajevo School of Science and Technology 29
Data Pre-processing: Discretization
 How:
◼ Supervised methods, discretization of attribute values with respect to the
outcome of the class attribute, normally used for classification
◼ Simple methods, sorting according to the class attribute, and then
discretizing the attribute values for each class. The problem is that the
attribute values may not be nicely distributed according to the class
values.
◼ Sophisticated methods, the discretization of the attribute values purifies
the outcome of the class. For instance, we can use entropy to measure
the degree of purity, and decide the split points recursively, similar to
decision tree induction (See Decision Tree Induction for detail).
◼ Merging methods, merging small intervals into a larger one with a stop
criterion (See association rule part for detail).
◼ Supervised discretization of two attributes

Sarajevo School of Science and Technology 30


Data Pre-processing:
Variable Transformation
 What: transform all values of an attribute to
2500
2400
2300

other values 2200


2100
2000

 Why: 1900
1800
1700

◼ Making the attribute values more sensible for 1600


1500

mining
1400
1300
1200

◼ Removing the effect of the outlier values 1100


1000

Making the result data visualization more


900
◼ 800
700

interpretable 600
500

◼ Making the values more comparable


400
300
200

 How:
100
0

count#

[1, 10]

[11, 20]

[21, 30]

[31, 40]

[41, 50]

[51, 60]

[61, 70]

[71, 80]

[81, 90]

[91, 100]

[101, 200]

[201, 300]

[301, 400]

[401, 500]

[141, 150]
◼ Transformation using function
 Ex. xk, log(x), sin(x), etc. Call time (sec)

◼ Standardization and/or normalization 2600

 Ex. z-score, division-by-range, etc. 2100

 Caution: transformation has to be done with 1600

Count#
care.
1100

600

100
[0, 1] [1, 2] [2, 3]
-400
logrithm (base 10) of Call Time

Sarajevo School of Science and Technology 31


Data Pre-processing:
Handling Missing Values
 What: to treat attributes with null values
 Why:
◼ Improve data quality
◼ Better mining results
 How:
◼ Eliminating data objects or attributes with missing values (may not always be
possible)
◼ Ignore the missing values
◼ Using sensible default values
 Ex. NumberOfTransactions is set to 0, if it is unknown
◼ Data imputation methods
 Average, median, or mode
 Average, median or mode of the nearest neighbors
◼ Postponing the handling to the data mining methods, making the mining methods
adaptive to missing values (see clustering and classification parts)

Sarajevo School of Science and Technology 32


Data Exploration
 Knowing your data is essential for the success of data mining
 “Turn your machine loose and see what it can find” is a myth about data
mining
 Purposes:
◼ Better understanding of the characteristics of data
◼ Better decision over data pre-processing tasks
◼ Even being able to discover some hidden patterns
 Categories of data exploration techniques
◼ Summary statistics: using a small set of descriptors to describe the characteristics
of a large data set
◼ Data visualization: using graphical or tabular forms to reveal data patterns
◼ Online Analytic Processing (OLAP)
 There are some overlaps between Data Exploration and Exploratory Data
Analysis (EDA)

Sarajevo School of Science and Technology 33


Data Summarization
 Frequency and Mode (often for categorical attributes):
◼ Frequency of value v: the number of data objects with v divided by the
total number of data objects
◼ Mode: the most frequently occurred value
◼ Ex. {11, 11, 12, 17, 18, 18, 19, 19, 20, 22, 24, 24, 25, 26}
Frequency of 11 is 2/14
Mode of the data set is 11, 18, 19 or 24
 Percentiles (for ordinal or continuous attributes):
◼ Given an attribute x and an integer p (0≤p≤100), the percentile xp is a
value of x such that p% observed values of x are less than xp.
◼ Ex. Age {11, 11, 12, 17, 18, 18, 19, 19, 20, 22, 24, 24, 25, 26}
Percentile Age50 = 19
Percentile Age70 = 22

Sarajevo School of Science and Technology 34


Data Summarization
 Mean and Median (for continuous attributes):
◼ Mean: total sum of values of attribute divided by the total
number
◼ Median: sort the values of attribute, and the middle value or the
average of the middle two values is the median.
◼ Median is a better indication of “average” when data distribution
is skewed or outliers are present
◼ Ex. Age {11, 11, 12, 17, 18, 18, 19, 19, 20, 22, 24, 24, 25, 26}
Mean of Age = ∑Age/Count(Age) = 266/14 = 19
Median of Age = Avg(19,19) = 19
 Trimmed Mean and Median (after trimming top & bottom
p%, p/2% each)

Sarajevo School of Science and Technology 35


Data Summarization
 Measures of Spread: range( x) = max( x ) − min( x )
◼ Range
m

(x
1
2 = − x) 2
m −1
i
i =1

◼ Variance (σ2) m

(x
1
 = − x) 2
◼ Standard Deviation (σ) m −1
i
i =1

Absolute average deviation (AAD) | x


1
◼ AAD ( x) =
m i =1
i − x|

 Multivariate Summary Statistics x = ( x1 , x2 ,..., xn )


◼ Mean vector m

(x
1
covariance( x, y ) = − x)( yi − y )
m −1
i

◼ Matrix of covariance i =1

covariance( x, y )
◼ Correlation correlation ( x, y ) =
 x y

Sarajevo School of Science and Technology 36


Data Visualization: Overview
 Rational: human eyes are good at spotting patterns, particularly
visual patterns.
 Major ways of visualizing data
◼ Tabular form
◼ Graphical form
◼ Points and links representing objects and relationships
 Visual representation of attributes is related to the data type of
the attributes
 Visualizing data as well as all its implicit relationships can be a
challenge.
 The visualization must have obvious affordance
 The visualization of data must tell the truth

Sarajevo School of Science and Technology 37


Data Visualization: Methods
 Power of Arrangement 1 2 3 4 5 6

◼ Arrange rows and columns


1 0 1 0 1 1 0
2 1 0 1 0 0 1
3 0 1 0 1 1 0

of a table
4 1 0 1 0 0 1
5 0 1 0 1 1 0
6 1 0 1 0 0 1
7 0 1 0 1 1 0

◼ Arrange graph components


8 1 0 1 0 0 1
9 0 1 0 1 1 0

◼ Sorting 4
6
1
1
1
3
1
2
0
5
0
4
0
2 1 1 1 0 0 0
6 1 1 1 0 0 0
8 1 1 1 0 0 0
5 0 0 0 1 1 1
3 0 0 0 1 1 1
9 0 0 0 1 1 1
1 0 0 0 1 1 1
7 0 0 0 1 1 1

Sarajevo School of Science and Technology 38


Data Visualization: Methods
 Pie Chart for Categorical
Attribute
 Histogram/Bar chart
 Stem and Leaf plots for a
single attribute
 Scatter plots, Contour plots
and surface plots
 Data matrix, Parallel dim.,
star dim. 90
80
70
Body Weight

60
Child
50
Teen
40 Adult
30
20
10
-5 5 15 25 35 45 Age

Sarajevo School of Science and Technology 39


Pattern Visualization:
Types of Patterns
outlook temperature humidity windy Class
sunny 85 high FALSE negative
sunny 80 high TRUE negative
overcast 83 high FALSE Positive
rain 70 high FALSE positive
outlook humidity windy Class
rain 68 normal FALSE positive overcast any any positive
rain 64 normal TRUE negative sunny high any negative
overcast 64 normal TRUE positive sunny normal any positive
sunny 72 high FALSE negative rain any TRUE negative
sunny 69 normal FALSE positive
rain any FALSE positive
rain 75 normal FALSE positive
sunny 75 normal TRUE positive
overcast 72 high TRUE positive
overcast 81 normal FALSE positive
rain 71 high TRUE negative

Name Category Subject City GPA


Anderson M.A. History London 3.5
Bach 2nd Math Buck 3.6
Fraser M.Sc. Physics Brighton 3.4 Subject City GPA
Patel Ph.D. Computing Bombay 3.5 Art Any Excellent
Hart 3rd Math London 2.7 Science Any Good
Jackson 3rd Computing Colchester 2.5 Science Foreign Excellent
Liu M.Sc. Computing Beijing 3.5
Meyer Ph.D. Biology Berlin 3.2
Longaney M.A. History Dehli 3.5
Xia Ph.D. Biology Shanghai 3.5

Sarajevo School of Science and Technology 40


Pattern Visualization:
Types of Patterns
Artificial Neural Networks Decision Trees

Sarajevo School of Science and Technology 41


Pattern Visualization:
Types of Patterns
 Rules of various kinds
◼ Classification rules
 Ex. IF A = a1 and B = b1 … THEN Class = c1
◼ Association rules
 Ex. IF jean and nappies, THEN beer
 Ex. IF food, THEN papers and magazines
◼ Rules with exceptions
 Ex. IF income = middle and hasCar THEN good customer
except IF age = teen THEN risky customer
sugar
◼ Rules involving relations flour
 Ex. IF width > height THEN things
 Ex. IF height > width THEN human
milk
Sarajevo School of Science and Technology 42
Pattern Visualization:
Types of Patterns
 Clusters

d
a e
h c a
Cluster 1 Cluster 2
0,4 0,1
Cluster 3
0,5

k i b b
c
0,1
0,3
0,8
0,3
0,1
0,4

g f d
e
0,1
0,4
0,1
0,2
0,8
0,4
f 0,7 0,1 0,2
g 0,5 0,4 0,1

Sarajevo School of Science and Technology 43


Data Exploration, OLAP
 Online Analytic Processing (OLAP)
◼ Interactive reporting with visualization
◼ Handles non-trivial queries
◼ Fast operation and fast delivery of result
◼ Directly support business operations
 Typical OLAP query:
◼ For each product, find its market share in its category
today minus its market share in its category in 1994
Products Market Share Today Market Share in 1994 difference
Dell 17" 17% 10% 7%
HP 15" 83% 90% -7%
Intel MotherB 56% 93% -37%
………… …. …. …….

Sarajevo School of Science and Technology 44


Data Exploration, OLAP
 Online Analytic Processing (OLAP)
◼ Treating a data set as a multidimensional
cube
◼ Operations over the cube for data
summarization and data drilling

Sarajevo School of Science and Technology 45


Data Exploration, OLAP
2000
Branch Name Customer Name Month Year
1999 Buckingham Helen Miles April 2000
1998 Buckingham Mary Laughton April 1999
…… …… …. ….
Milton Keynes Alen Young Feb 2000
Northampton Milton Keynes Susan Young April 2000
…… …… …. ….
Milton Keynes Northampton Frank Sinatra April 1998
………… …. …. ….
Buckingham
Jan Feb March Dec

• Total Customer = 5
2000
1999
• Customer Names 1998

Northampton
Milton Keynes
1998 Milton Keynes
March Buckingham
winter spring summer autumn

Sarajevo School of Science and Technology 46


Summary
 To conduct data mining effectively, one must know the data set.
 Data types have a significant effect on the meaningfulness of the patterns. Inappropriate
operations applied to wrong types of data makes the patterns meaningless.
 Data quality is an important but relative issue. “Garbage-in, garbage-out”! However,
making mining solutions “garbage” tolerant is equally important.
 Various data pre-processing tasks may need to be conducted before mining starts. This
stage normally takes most of the time.
 Which data pre-processing needs to be performed depends on the intended application
and intended data mining tasks.
 Data exploration helps data understanding, and hence is the first task before serious
mining.
 Using data summary, visualisation and even OLAP operations, we can gain insight into
the nature of the data, which helps in forming valid and sensible mining operations.

Sarajevo School of Science and Technology 47


Further Reading
 Hongbo Du, „Data Mining Techniques and
Applications“, Chapter 3
 Witten, I & Frank E. “Data Mining Practical
Machine Learning Tools and Techniques
with Java Implementations”, Chapter 2,
Morgan Kaufmann, 2000.
 Tan, P., Steinbach, M. & Kumar, V.
“Introduction to Data Mining”, Chapter 2 &
3, Addison-Wesley, 2006
Sarajevo School of Science and Technology 48

You might also like