0% found this document useful (0 votes)

118 views48 pages

Data Mining Techniques & Applications

This document provides an overview of data mining techniques and applications, including different types of data sets. It discusses how a data set typically consists of a collection of data objects or instances that are characterized by attributes and measurements. Common types of data sets include tables of records, transaction databases, data matrices, graph-based data, ordered/sequential data, and spatial data. Understanding the different data types is important for correctly applying data mining techniques and exploring and summarizing data.

Uploaded by

AzerMušinović

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views48 pages

Data Mining Techniques & Applications

Uploaded by

AzerMušinović

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Data Mining

Techniques & Applications

Processing & Data Exploration

Topics
 Data and Data sets
 Data Types
 Data Sources and Data Quality
 Data Pre-processing Methods
 Data Summarization
 Data Visualization
 Pattern Visualization
 Overview of OLAP
Sarajevo School of Science and Technology 2
Data
 Instance/Example/Object
◼ The input data set is normally considered as a collection of data objects or
instances or examples.
◼ Each instance is an individual and independent record of real life events.
 Attribute and Measurement
◼ A data instance is characterized by its values on a fixed, pre-defined set of features
or attributes.
◼ An attribute describes a specific property or characteristic of a data object.
◼ Measurement is a process of assigning a valid value to an attribute according to a
measurement scale.
◼ Use of appropriate measurement scales is extremely important for the correct
understanding of the attribute.
 Ex. <1002, “J.Smith”, 23, “Milton Keynes”, £230k, 20/2/95, on time>

Sarajevo School of Science and Technology 3

Types of Data Sets
Age Group Own Car Income Band Class

 Data set of records young

young
yes
no
low
low
risky
risky
middle aged yes middle risky
◼ Table of records (single middle aged
middle aged
no
yes
high
low
safe
risky
table or joining of young yes high risky

multiple tables)
middle aged no low safe
retired yes middle safe
TID Items
retired no middle safe
◼ Transaction database
100 apple, beer, newspaper
200 retired
apple, yes
beef, beer, newspaper, potato high safe
300 beef, potato

(transaction ID and set 400

500
Object Count Diameter
beef, noodles
beef, potato
Area Shape Factor Intensity x y

of items) 34
41
59
15,19
16,07
19,98
181,14
202,83
313,46
0,99
0,9
0,88
655,58
953,22
688,95
249,74
241,63
239,67
242,32
244,51
248,89

◼ Data matrix (table with

44 18,14 258,49 0,98 786,62 225,83 243,87
30 17,42 238,31 0,96 501,1 405,88 66,2
30 15,59 190,93 1,01 647,18 237,58 247,98

numeric attributes)

match

coach
game
32 15,16 180,49 0,99 624,84 233,26 263,12

play
win
DocumentID
◼ Document-term matrix 1
2
10
2
2
1
2
0
2
0
3
2
3 0 34 5 10 10
4 4 0 1 2 2

Sarajevo School of Science and Technology 4

Types of Data Sets
 Graph-based data
◼ Using graph to represent the relationship between data
objects
◼ Data object structure is represented as graphs

◼ .

Sarajevo School of Science and Technology 5

Types of Data Sets
 Ordered data
◼ Sequential (temporal) data: i.e. record + time tag
◼ Sequence data: an ordered sequence of entities without time tag
◼ Time series data: each record is a time series of measurements
◼ Spatial data:
GGTTCCGCCTTCAGCCC
CGCGCCCGCAGGG…
CID Items
1 (t1:apple, milk), (t4:newspaper)
2 (t1: apple, beef), (t3: milk, newspaper)
3 (t2: beef, potato)
4 (t4: beef, noodles)

Foci Progress
Region measured at 0.5 h measured at 1 h measured at 2 h measured at 6 h
1 72.30 33.98 30.7 10.2
2 65 32.5 26.4 12.5
3 67.8 34.3 22.1 8.4

Sarajevo School of Science and Technology 6

Types of Data Sets
 Most used form of data set is record set
 A record may directly capture the raw record data or the
extracted features of non-record raw data
 Ex. Instead of the following skin image, we measure the
number of cells, the thickness of the dermis layer, the
thickness of fat layer, and then label the record as
“diabetic”.

Fat layer

Dermis layer . Cells

Sarajevo School of Science and Technology 7

General Characteristics of Data Sets
 Size
◼ In terms of the total number of records
◼ Small (MB), medium (GB) and large (TB)
 Dimensionality
◼ Varies from data set to data set, from low to extremely high
◼ Curse of dimensions
 Sparsity
◼ Values are skewed to some extreme or sub-ranges
◼ Asymmetric (some values are more important than others)
◼ May be useful for speedy processing
 Resolution
◼ Right level of data details
◼ Related to the intended purpose
◼ Not necessarily “the more details the better”

Sarajevo School of Science and Technology 8

Data Types
 Ideally, the domain features must reflect the
properties of the attribute.
◼ Ex. attribute Age is an integer between 0 and 130
 However, this is not entirely nor always
observed.
◼ Ex. EmpID attribute is usually declared as integer
type. Integers 1001 and 1020 represent two
different employees. However, we know that adding
the two numbers or taking the average would not
make a sense.

Sarajevo School of Science and Technology 9

Data Types
 Different types:
◼ Categorical/Qualitative types (Nominal,
Ordinal)
◼ Numeric/Quantitative types (Interval, Ratio)
◼ Discrete vs. continuous attributes
◼ Domain nature, applicable operations, and
transformation

Sarajevo School of Science and Technology 10

Data Types: Nominal
 A set of names, no concept of order and difference
◼ Ex. attribute Name: {“Jones”, “Smith”, “Wang”, “Richardson”}
 Only enough information to distinguish one object from
another
 Operators applicable: =, ≠
◼ Ex. SELECT * FROM Customers WHERE name = “Richardson”
 Caution: attributes declared as numbers should not be
treated as numbers. Ex. EmpID.
 A one-to-one transformation is permissible
◼ Ex. Gender:{“Female”, “Male”} => {F, M} => {0, 1}
◼ Ex. EmpID :[1, 10] => EmpID:{A, B, C, D, E, F, G, H, X, S}

Sarajevo School of Science and Technology 11

Data Types: Ordinal
 A set of names with concept of order, without concept of
difference
◼ Ex. Temperature: {cold, warm, hot}
◼ Ex. Grades: {F, D, C, B, A}
 Operator applicable: =, ≠, <, >, ≤, ≥
◼ Ex. Cold < warm < hot
 We only know there is a difference between two values,
but we do not know by how much
◼ Ex. Hot – warm = warm – cold? Hot = 2 * warm?
 An order-preserving transformation of values is permitted
◼ Ex. Customer status: {bad, average, good, excellent} =>
Customer status: (1, 2, 3, 4)

Sarajevo School of Science and Technology 12

Data Types: Interval
 A set of numeric values with concepts of order and difference
◼ Ex. Calendar year: 1945, 1970, 2006
◼ Ex. Temperature scales °F and °C
 Operators applicable: all operators for the ordinal type plus +, -
◼ Ex. Calendar year: 1945 < 1970, 2007 – 1959 = 48
 There is no reference to an apparent absolute zero. We cannot
use operations such as * and /
◼ Ex. Is 20°C twice as warm as 10°C?
◼ Ex. Does 2007 * 2 really make sense?
 The transformation of new_value = a * old_value + b where a
and b are constants is permitted
◼ Ex. Temperature: °C = (5/9) * (°F – 32)

Sarajevo School of Science and Technology 13

Data Types: Ratio
 A set of values with concepts of order, difference and ratio
 The set has an absolute zero
 Operator applicable: all operators for categorical type plus
+, -, *, /
 The transformation of new_value = a * old_value where a
is a constant is permitted. Ex. Converting meters to feet.
 Examples:
◼ The difference between my age (30) and my niece’s age (10) is
20 years. My age is 3 times of my nieces.
◼ If the distance between Tuzla and Sarajevo is 130 km, and the
between Mostar and Tuzla is 260 km, then the distance between
MO and TZ distance between TZ and SA.

Sarajevo School of Science and Technology 14

Data Source
 Operational databases (primarily for
tactical decision making)
◼ Single table
◼ Inner/Outer joins of a number of related
tables
 External sources
◼ From partners or third-party
◼ Combined with internal data source
Sarajevo School of Science and Technology 15
Data Source
 Data warehouse
◼ An organizational database for decision making
◼ A central data repository separate from
operational systems, and upload and integrate
data from the operational systems
◼ Organization wide data consistency and data
integration
◼ Data details as well as data summarization
◼ Equipped with data analysis and reporting tools
◼ Refer to later parts of this course
Sarajevo School of Science and Technology 16
Data Quality
 Data quality, an important issue for information-based decision making,
but largely ignored by many organisations
 “Garbage-in, garbage-out”
 Evaluating data quality
◼ Accuracy
◼ Correctness
◼ Completeness
◼ Consistency
◼ Redundancy
 For data mining, “addressing quality issue at source” cannot be always
expected
◼ By data cleaning
◼ By using tolerate mining algorithms
 Quality is relevant to the intended purpose of data mining

Sarajevo School of Science and Technology 17

Data Source and Data Quality
 Measurement and Data Collection Issues w.r.t Quality
◼ Noise
 random measurement error by adding spurious objects
 Often associated with data with spatial and temporal properties
◼ Precision, bias and accuracy of measurements (when a number
of measurements are repeated)
 Precision: the closeness of the measurements to one another,
measured by the standard deviation of the measurements
 Bias: a systematic variation of measurements from the quantity
being measured, measured against the known external value
 Accuracy: the closeness of the measure to the true value, normally
indicated by the number of significant digits of the measurements

Sarajevo School of Science and Technology 18

Data Source and Data Quality
 Measurement and Data Collection Issues
w.r.t Quality
◼ Outliers
 Different from most values in the data set or
 Unusual with respect to the typical values
◼ Missing values (Null value)
 Not measured or Not available
◼ Inconsistency (within data, between data and
overall)
Sarajevo School of Science and Technology 19
Data Source and Data Quality
 Application Issues w.r.t Quality
◼ Timeliness
 Data have a limited period of time of validity
 Patterns drawn from old data may not be applicable to
the current situation
◼ Relevance
 Selection of attributes for data mining is related to the
purpose of mining
 Patterns mined may miss important attributes if not
careful
 Not all attributes are relevant to the mining task

Sarajevo School of Science and Technology 20

Data Source and Data Quality
 Application Issues w.r.t Quality
◼ Sampling bias
 Sampling can greatly reduce the search space and improve
speed of mining
 Sampling also introduces bias where not all data have fair
representation in the sample (ex. a biased lottery machine)
◼ Knowledge on data
 Documentation about data must provide sufficient correct
information about data
 Elementary data analysis (data exploration)
 Knowledge on data may help to improve the data quality

Sarajevo School of Science and Technology 21

Data Pre-processing
 Processing the data before mining starts
 Purpose: for speedy, cost-effective and high quality
outcomes of data mining
 Pre-processing tasks
◼ Data aggregation
◼ Data sampling
◼ Dimension reduction
◼ Feature selection
◼ Feature creation
◼ Discretization / binarization
◼ Variable transformation
◼ Dealing with missing values

Sarajevo School of Science and Technology 22

Data Pre-processing: Data Aggregation
 What: to reduce the number of TID Date Item Store Price Clubcard# ……

data objects or their attributes ……

32144
……
06/06/2006
……
milk
……
Buckingham
……
1.99
……
1111
……
……

by summarizing low level data 11122

11122
04/04/2006
04/04/2006
watch
battery
Buckingham
Buckingham
25.99
3.99
1011
1011
……
……
details to higher level data 11123 04/04/2006 beer Buckingham 9.99 1022 ……

abstraction
22244 04/04/2006 beer MK 6.99 1022 ……
22244 04/04/2006 nappies MK 10.89 1022 ……
23311 05/04/2006 beer MK 6.99 1011 ……

 Why: to reduce the time of …… …… …… …… …… …… ……

mining, to rescale data values,

and to discover more stable Date Store AveragePrice ……

patterns ……
06/06/2006
……
Buckingham
……
1.99
……
……

 How:
04/04/2006 Buckingham 13.32 ……
04/04/2006 MK 8.94 ……
05/04/2006 MK 6.99 ……
◼ By generalization using a given …… …… …… ……

concept hierarchy
By applying aggregate functions
Number of Items TotalPrice Clubcard# ……
◼ …… …… …… ……
(count, sum, average, etc.) 1
3
1.99
36.97
1111
1011
……
……
◼ Dropping some attributes ……
2 27.87
……
1022
……
……
……

Sarajevo School of Science and Technology 23

Data Pre-processing: Data Sampling
 What: selecting a subset of the given data set
 Why: to make it possible to use sophisticated mining algorithms within a time limit.
 Caution: the sample must be representative of the original data set
 How:
◼ Random sampling
 Sampling without replacement
 Sampling with replacement
 Not much difference when the sample size is small comparing to the original data set size
◼ Stratified sampling
 Grouping the data in the data set
 Sampling from each group
 Size of each group may be different and may influence the number of samples selected
◼ Progressive sampling
 Starting with a small sample and gradually increase its size
 Evaluate its representation of the original data set
 Stop when the representation is close enough

Sarajevo School of Science and Technology 24

Data Pre-processing:
Data Dimension Reduction
 The Problem:
◼ Data set may have many dimensions (Ex. a 120x80 pixel image)
◼ Curse of dimensionality
 As dimensionality increases, the data become increasingly sparse
 The data analysis becomes harder because the mined patterns are less significant
and peculiar.
 The time for data processing increases significantly as dimensionality increases
 Not all attributes are equally significant in describing the pattern
 Why: to reduce the effects of curse of dimensionality
 How:
◼ Linear algebra techniques
 Principal component analysis (PCA) :: A B C -> D E F
 Independent component analysis (ICA) :: A B C -> D E
 Single value decomposition (SVD) :: matrices
◼ Feature subset selection

Sarajevo School of Science and Technology 25

Data Pre-processing:
Feature Subset Selection
 What: to reduce dimensionality by selecting a subset of attributes
 Purposes:
◼ To remove or reduce redundant features (ex. derived attributes)
◼ To remove irrelevant features that contains no useful information for the data
mining task (ex. ID attribute)
 How:
◼ Use common sense and domain knowledge for manual selection
◼ Embedded approach: let the mining algorithm to select suitable features (ex.
decision induction)
◼ Filter and wrapper approaches:

attributes
Subset Not ok Stopping ok Selected Validate with
selection criterion subset Mining task

One subset evaluation

Sarajevo School of Science and Technology 26

Data Pre-processing: Feature Creation
 What: to create a new set of features from the original features
 Purpose: in the new feature space, meaningful patterns can be
extracted more easily. The number of features may be reduced.
 How:
◼ Using feature extraction methods to extract new features from the
existing ones
 Ex. extracting color, texture and shape from image of pixel values
◼ Mapping data to a new space
 Ex. wavelet/fourier transformation of pixel values of images to a frequency
domain
◼ Constructing new features from the existing ones using domain
knowledge
 Ex. using transaction dates to construct a new feature customer tenure that
indicates the loyalty of the customer

Sarajevo School of Science and Technology 27

Data Pre-processing:
Discretization / Binarization
 What: to convert continuous attribute values to discrete
categorical values, and to convert discrete categorical values to
binary Boolean attribute values
 Why:
◼ Requirement for some data mining solutions
◼ Better data mining results
 How:
◼ Binarization: convert m categorical values to [0, m-1] and
 Convert each to binary number of n bits where n = log2m
 Use m asymmetric binary variables to represent each of m values
◼ Discretization process:
 Deciding how many categories to have and where split points should
be
 Mapping values to categories: sort the values, split them into sub-
ranges and map all the values in a sub-range to the same category

Sarajevo School of Science and Technology 28

Data Pre-processing: Discretization
 How:
◼ Unsupervised methods, discretization
without a concern to the outcome of one
specific attribute, normally used to clustering
and association rule discovery
Original Data Set: {11, 11, 12, 17, 18, 18, 19, 19, 20, 22, 24, 24, 25, 26}
Equal width method: {C, C, C, T, T, T, T, T, T, Y, Y, Y, Y, A}
Equal frequency method: {C, C, C, C, T, T, T, T, T, Y, Y, Y, Y, Y}
Clustering method: {C, C, C, T, T, T, T, T, T, Y, Y, Y, Y, Y}
Sarajevo School of Science and Technology 29
Data Pre-processing: Discretization
 How:
◼ Supervised methods, discretization of attribute values with respect to the
outcome of the class attribute, normally used for classification
◼ Simple methods, sorting according to the class attribute, and then
discretizing the attribute values for each class. The problem is that the
attribute values may not be nicely distributed according to the class
values.
◼ Sophisticated methods, the discretization of the attribute values purifies
the outcome of the class. For instance, we can use entropy to measure
the degree of purity, and decide the split points recursively, similar to
decision tree induction (See Decision Tree Induction for detail).
◼ Merging methods, merging small intervals into a larger one with a stop
criterion (See association rule part for detail).
◼ Supervised discretization of two attributes

Sarajevo School of Science and Technology 30

Data Pre-processing:
Variable Transformation
 What: transform all values of an attribute to
2500
2400
2300

other values 2200

2100
2000

 Why: 1900
1800
1700

◼ Making the attribute values more sensible for 1600

1500

mining
1400
1300
1200

◼ Removing the effect of the outlier values 1100

1000

Making the result data visualization more

900
◼ 800
700

interpretable 600
500

◼ Making the values more comparable

400
300
200

 How:
100
0

count#

[1, 10]

[11, 20]

[21, 30]

[31, 40]

[41, 50]

[51, 60]

[61, 70]

[71, 80]

[81, 90]

[91, 100]

[101, 200]

[201, 300]

[301, 400]

[401, 500]

[141, 150]
◼ Transformation using function
 Ex. xk, log(x), sin(x), etc. Call time (sec)

◼ Standardization and/or normalization 2600

 Ex. z-score, division-by-range, etc. 2100

 Caution: transformation has to be done with 1600

Count#
care.
1100

600

100
[0, 1] [1, 2] [2, 3]
-400
logrithm (base 10) of Call Time

Sarajevo School of Science and Technology 31

Data Pre-processing:
Handling Missing Values
 What: to treat attributes with null values
 Why:
◼ Improve data quality
◼ Better mining results
 How:
◼ Eliminating data objects or attributes with missing values (may not always be
possible)
◼ Ignore the missing values
◼ Using sensible default values
 Ex. NumberOfTransactions is set to 0, if it is unknown
◼ Data imputation methods
 Average, median, or mode
 Average, median or mode of the nearest neighbors
◼ Postponing the handling to the data mining methods, making the mining methods
adaptive to missing values (see clustering and classification parts)

Sarajevo School of Science and Technology 32

Data Exploration
 Knowing your data is essential for the success of data mining
 “Turn your machine loose and see what it can find” is a myth about data
mining
 Purposes:
◼ Better understanding of the characteristics of data
◼ Better decision over data pre-processing tasks
◼ Even being able to discover some hidden patterns
 Categories of data exploration techniques
◼ Summary statistics: using a small set of descriptors to describe the characteristics
of a large data set
◼ Data visualization: using graphical or tabular forms to reveal data patterns
◼ Online Analytic Processing (OLAP)
 There are some overlaps between Data Exploration and Exploratory Data
Analysis (EDA)

Sarajevo School of Science and Technology 33

Data Summarization
 Frequency and Mode (often for categorical attributes):
◼ Frequency of value v: the number of data objects with v divided by the
total number of data objects
◼ Mode: the most frequently occurred value
◼ Ex. {11, 11, 12, 17, 18, 18, 19, 19, 20, 22, 24, 24, 25, 26}
Frequency of 11 is 2/14
Mode of the data set is 11, 18, 19 or 24
 Percentiles (for ordinal or continuous attributes):
◼ Given an attribute x and an integer p (0≤p≤100), the percentile xp is a
value of x such that p% observed values of x are less than xp.
◼ Ex. Age {11, 11, 12, 17, 18, 18, 19, 19, 20, 22, 24, 24, 25, 26}
Percentile Age50 = 19
Percentile Age70 = 22

Sarajevo School of Science and Technology 34

Data Summarization
 Mean and Median (for continuous attributes):
◼ Mean: total sum of values of attribute divided by the total
number
◼ Median: sort the values of attribute, and the middle value or the
average of the middle two values is the median.
◼ Median is a better indication of “average” when data distribution
is skewed or outliers are present
◼ Ex. Age {11, 11, 12, 17, 18, 18, 19, 19, 20, 22, 24, 24, 25, 26}
Mean of Age = ∑Age/Count(Age) = 266/14 = 19
Median of Age = Avg(19,19) = 19
 Trimmed Mean and Median (after trimming top & bottom
p%, p/2% each)

Sarajevo School of Science and Technology 35

Data Summarization
 Measures of Spread: range( x) = max( x ) − min( x )
◼ Range
m

(x
1
2 = − x) 2
m −1
i
i =1

◼ Variance (σ2) m

(x
1
 = − x) 2
◼ Standard Deviation (σ) m −1
i
i =1

Absolute average deviation (AAD) | x

1
◼ AAD ( x) =
m i =1
i − x|

 Multivariate Summary Statistics x = ( x1 , x2 ,..., xn )

◼ Mean vector m

(x
1
covariance( x, y ) = − x)( yi − y )
m −1
i

◼ Matrix of covariance i =1

covariance( x, y )
◼ Correlation correlation ( x, y ) =
 x y

Sarajevo School of Science and Technology 36

Data Visualization: Overview
 Rational: human eyes are good at spotting patterns, particularly
visual patterns.
 Major ways of visualizing data
◼ Tabular form
◼ Graphical form
◼ Points and links representing objects and relationships
 Visual representation of attributes is related to the data type of
the attributes
 Visualizing data as well as all its implicit relationships can be a
challenge.
 The visualization must have obvious affordance
 The visualization of data must tell the truth

Sarajevo School of Science and Technology 37

Data Visualization: Methods
 Power of Arrangement 1 2 3 4 5 6

◼ Arrange rows and columns

1 0 1 0 1 1 0
2 1 0 1 0 0 1
3 0 1 0 1 1 0

of a table
4 1 0 1 0 0 1
5 0 1 0 1 1 0
6 1 0 1 0 0 1
7 0 1 0 1 1 0

◼ Arrange graph components

8 1 0 1 0 0 1
9 0 1 0 1 1 0

◼ Sorting 4
6
1
1
1
3
1
2
0
5
0
4
0
2 1 1 1 0 0 0
6 1 1 1 0 0 0
8 1 1 1 0 0 0
5 0 0 0 1 1 1
3 0 0 0 1 1 1
9 0 0 0 1 1 1
1 0 0 0 1 1 1
7 0 0 0 1 1 1

Sarajevo School of Science and Technology 38

Data Visualization: Methods
 Pie Chart for Categorical
Attribute
 Histogram/Bar chart
 Stem and Leaf plots for a
single attribute
 Scatter plots, Contour plots
and surface plots
 Data matrix, Parallel dim.,
star dim. 90
80
70
Body Weight

60
Child
50
Teen
40 Adult
30
20
10
-5 5 15 25 35 45 Age

Sarajevo School of Science and Technology 39

Pattern Visualization:
Types of Patterns
outlook temperature humidity windy Class
sunny 85 high FALSE negative
sunny 80 high TRUE negative
overcast 83 high FALSE Positive
rain 70 high FALSE positive
outlook humidity windy Class
rain 68 normal FALSE positive overcast any any positive
rain 64 normal TRUE negative sunny high any negative
overcast 64 normal TRUE positive sunny normal any positive
sunny 72 high FALSE negative rain any TRUE negative
sunny 69 normal FALSE positive
rain any FALSE positive
rain 75 normal FALSE positive
sunny 75 normal TRUE positive
overcast 72 high TRUE positive
overcast 81 normal FALSE positive
rain 71 high TRUE negative

Name Category Subject City GPA

Anderson M.A. History London 3.5
Bach 2nd Math Buck 3.6
Fraser M.Sc. Physics Brighton 3.4 Subject City GPA
Patel Ph.D. Computing Bombay 3.5 Art Any Excellent
Hart 3rd Math London 2.7 Science Any Good
Jackson 3rd Computing Colchester 2.5 Science Foreign Excellent
Liu M.Sc. Computing Beijing 3.5
Meyer Ph.D. Biology Berlin 3.2
Longaney M.A. History Dehli 3.5
Xia Ph.D. Biology Shanghai 3.5

Sarajevo School of Science and Technology 40

Pattern Visualization:
Types of Patterns
Artificial Neural Networks Decision Trees

Sarajevo School of Science and Technology 41

Pattern Visualization:
Types of Patterns
 Rules of various kinds
◼ Classification rules
 Ex. IF A = a1 and B = b1 … THEN Class = c1
◼ Association rules
 Ex. IF jean and nappies, THEN beer
 Ex. IF food, THEN papers and magazines
◼ Rules with exceptions
 Ex. IF income = middle and hasCar THEN good customer
except IF age = teen THEN risky customer
sugar
◼ Rules involving relations flour
 Ex. IF width > height THEN things
 Ex. IF height > width THEN human
milk
Sarajevo School of Science and Technology 42
Pattern Visualization:
Types of Patterns
 Clusters

d
a e
h c a
Cluster 1 Cluster 2
0,4 0,1
Cluster 3
0,5

k i b b
c
0,1
0,3
0,8
0,3
0,1
0,4

g f d
e
0,1
0,4
0,1
0,2
0,8
0,4
f 0,7 0,1 0,2
g 0,5 0,4 0,1

Sarajevo School of Science and Technology 43

Data Exploration, OLAP
 Online Analytic Processing (OLAP)
◼ Interactive reporting with visualization
◼ Handles non-trivial queries
◼ Fast operation and fast delivery of result
◼ Directly support business operations
 Typical OLAP query:
◼ For each product, find its market share in its category
today minus its market share in its category in 1994
Products Market Share Today Market Share in 1994 difference
Dell 17" 17% 10% 7%
HP 15" 83% 90% -7%
Intel MotherB 56% 93% -37%
………… …. …. …….

Sarajevo School of Science and Technology 44

Data Exploration, OLAP
 Online Analytic Processing (OLAP)
◼ Treating a data set as a multidimensional
cube
◼ Operations over the cube for data
summarization and data drilling

Sarajevo School of Science and Technology 45

Data Exploration, OLAP
2000
Branch Name Customer Name Month Year
1999 Buckingham Helen Miles April 2000
1998 Buckingham Mary Laughton April 1999
…… …… …. ….
Milton Keynes Alen Young Feb 2000
Northampton Milton Keynes Susan Young April 2000
…… …… …. ….
Milton Keynes Northampton Frank Sinatra April 1998
………… …. …. ….
Buckingham
Jan Feb March Dec

• Total Customer = 5
2000
1999
• Customer Names 1998

Northampton
Milton Keynes
1998 Milton Keynes
March Buckingham
winter spring summer autumn

Sarajevo School of Science and Technology 46

Summary
 To conduct data mining effectively, one must know the data set.
 Data types have a significant effect on the meaningfulness of the patterns. Inappropriate
operations applied to wrong types of data makes the patterns meaningless.
 Data quality is an important but relative issue. “Garbage-in, garbage-out”! However,
making mining solutions “garbage” tolerant is equally important.
 Various data pre-processing tasks may need to be conducted before mining starts. This
stage normally takes most of the time.
 Which data pre-processing needs to be performed depends on the intended application
and intended data mining tasks.
 Data exploration helps data understanding, and hence is the first task before serious
mining.
 Using data summary, visualisation and even OLAP operations, we can gain insight into
the nature of the data, which helps in forming valid and sensible mining operations.

Sarajevo School of Science and Technology 47

Further Reading
 Hongbo Du, „Data Mining Techniques and
Applications“, Chapter 3
 Witten, I & Frank E. “Data Mining Practical
Machine Learning Tools and Techniques
with Java Implementations”, Chapter 2,
Morgan Kaufmann, 2000.
 Tan, P., Steinbach, M. & Kumar, V.
“Introduction to Data Mining”, Chapter 2 &
3, Addison-Wesley, 2006
Sarajevo School of Science and Technology 48

Pine Script v5 User Manual (200-350)
No ratings yet
Pine Script v5 User Manual (200-350)
151 pages
PWNSAT Sample Paper Class 10th Sample Paper Questions
71% (7)
PWNSAT Sample Paper Class 10th Sample Paper Questions
7 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Insurance Certificate Mytravel Cover: Copy To Be Retained
No ratings yet
Insurance Certificate Mytravel Cover: Copy To Be Retained
1 page
Data Mining Techniques & Applications: Association Rules
No ratings yet
Data Mining Techniques & Applications: Association Rules
50 pages
MYRIAD MODEL User Reference Guide
No ratings yet
MYRIAD MODEL User Reference Guide
74 pages
Berry-Esseen Central Limit The
No ratings yet
Berry-Esseen Central Limit The
65 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
CL 2
No ratings yet
CL 2
85 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Full
No ratings yet
Full
367 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Chapter 2.1 2.2
No ratings yet
Chapter 2.1 2.2
40 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
25 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Attributes
No ratings yet
Attributes
66 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Data
No ratings yet
Data
84 pages
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
No ratings yet
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
94 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Unit I
No ratings yet
Unit I
57 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Lecture2 IntroData
No ratings yet
Lecture2 IntroData
16 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
DM - Midsem - Question Bank
No ratings yet
DM - Midsem - Question Bank
5 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
CIS62283 02 PreProcessing
100% (1)
CIS62283 02 PreProcessing
51 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Insurance Certificate Mytravel Cover: Copy To Be Retained
No ratings yet
Insurance Certificate Mytravel Cover: Copy To Be Retained
1 page
Csis355 Classifications 1
No ratings yet
Csis355 Classifications 1
70 pages
2.predavanje-Kontrola Lijekova
No ratings yet
2.predavanje-Kontrola Lijekova
8 pages
Revision of Passive
No ratings yet
Revision of Passive
2 pages
Coca-Cola Color Splash PDF
No ratings yet
Coca-Cola Color Splash PDF
1 page
Jowett 0
No ratings yet
Jowett 0
7 pages
SETS
50% (2)
SETS
26 pages
Repair and Maintenance: Cooler
100% (1)
Repair and Maintenance: Cooler
61 pages
Harmonica Chords
100% (1)
Harmonica Chords
2 pages
Section 6 Quiz 1 l1 l4
No ratings yet
Section 6 Quiz 1 l1 l4
4 pages
LQ043T3DX02 SP 122805 PDF
No ratings yet
LQ043T3DX02 SP 122805 PDF
25 pages
Chapter 4 Conic Section and Its Application
100% (1)
Chapter 4 Conic Section and Its Application
13 pages
TM800V Service Manual
No ratings yet
TM800V Service Manual
149 pages
Novel Convolutional Neural Network (NCNN) For The Diagnosis of Bearing Defects in Rotary Machinery
No ratings yet
Novel Convolutional Neural Network (NCNN) For The Diagnosis of Bearing Defects in Rotary Machinery
10 pages
FN-642 842 1042-ULADA DataSheet
No ratings yet
FN-642 842 1042-ULADA DataSheet
1 page
Taking The Control System For Granted - Ensuring The Integrity of Sub-Sil Instrumented Functions
No ratings yet
Taking The Control System For Granted - Ensuring The Integrity of Sub-Sil Instrumented Functions
5 pages
Physics
100% (1)
Physics
7 pages
Projector+itachi CP S370W
No ratings yet
Projector+itachi CP S370W
60 pages
DX Diag
No ratings yet
DX Diag
33 pages
Program Gempur SPM Perlis P2 - 2018
No ratings yet
Program Gempur SPM Perlis P2 - 2018
16 pages
ENB301 Practice Mid-Sem Exam PDF
No ratings yet
ENB301 Practice Mid-Sem Exam PDF
2 pages
Computer Science Class Notes
No ratings yet
Computer Science Class Notes
3 pages
Le Châtelier's Principle: Experiment 5
No ratings yet
Le Châtelier's Principle: Experiment 5
5 pages
Statistical Tool Iggat Shaira Salinen Ruffa Grace
No ratings yet
Statistical Tool Iggat Shaira Salinen Ruffa Grace
14 pages
1, S4 New Curriculum Chemistry Chapter 3 - Trendsin The Periodic Table
No ratings yet
1, S4 New Curriculum Chemistry Chapter 3 - Trendsin The Periodic Table
9 pages
Microcontrollers
No ratings yet
Microcontrollers
13 pages
Cavity Vent Valve
No ratings yet
Cavity Vent Valve
2 pages
Free Body Diagrams With Animated GIF Files: Paper ID #16401
No ratings yet
Free Body Diagrams With Animated GIF Files: Paper ID #16401
12 pages
6 Mips Datapath
No ratings yet
6 Mips Datapath
55 pages
MAths IGCSE PAper 2 May 2002
60% (5)
MAths IGCSE PAper 2 May 2002
12 pages
BERGHOUT Et Al, 2020 - Aircraft Engines Remaining Useful Life Prediction With An Adaptive Denoising Online Sequential Extreme Learning Machine
No ratings yet
BERGHOUT Et Al, 2020 - Aircraft Engines Remaining Useful Life Prediction With An Adaptive Denoising Online Sequential Extreme Learning Machine
10 pages

Data Mining Techniques & Applications

Uploaded by

Data Mining Techniques & Applications

Uploaded by

Data Mining

Techniques & Applications

Processing & Data Exploration

Sarajevo School of Science and Technology 3

 Data set of records young

(transaction ID and set 400

◼ Data matrix (table with

Sarajevo School of Science and Technology 4

Sarajevo School of Science and Technology 5

Sarajevo School of Science and Technology 6

Dermis layer . Cells

Sarajevo School of Science and Technology 7

Sarajevo School of Science and Technology 8

Sarajevo School of Science and Technology 9

Sarajevo School of Science and Technology 10

Sarajevo School of Science and Technology 11

Sarajevo School of Science and Technology 12

Sarajevo School of Science and Technology 13

Sarajevo School of Science and Technology 14

Sarajevo School of Science and Technology 17

Sarajevo School of Science and Technology 18

Sarajevo School of Science and Technology 20

Sarajevo School of Science and Technology 21

Sarajevo School of Science and Technology 22

data objects or their attributes ……

by summarizing low level data 11122

 Why: to reduce the time of …… …… …… …… …… …… ……

mining, to rescale data values,

Sarajevo School of Science and Technology 23

Sarajevo School of Science and Technology 24

Sarajevo School of Science and Technology 25

One subset evaluation

Sarajevo School of Science and Technology 26

Sarajevo School of Science and Technology 27

Sarajevo School of Science and Technology 28

Sarajevo School of Science and Technology 30

other values 2200

◼ Making the attribute values more sensible for 1600

◼ Removing the effect of the outlier values 1100

Making the result data visualization more

◼ Making the values more comparable

◼ Standardization and/or normalization 2600

 Ex. z-score, division-by-range, etc. 2100

 Caution: transformation has to be done with 1600

Sarajevo School of Science and Technology 31

Sarajevo School of Science and Technology 32

Sarajevo School of Science and Technology 33

Sarajevo School of Science and Technology 34

Sarajevo School of Science and Technology 35

Absolute average deviation (AAD) | x

 Multivariate Summary Statistics x = ( x1 , x2 ,..., xn )

Sarajevo School of Science and Technology 36

Sarajevo School of Science and Technology 37

◼ Arrange rows and columns

◼ Arrange graph components

Sarajevo School of Science and Technology 38

Sarajevo School of Science and Technology 39

Name Category Subject City GPA

Sarajevo School of Science and Technology 40

Sarajevo School of Science and Technology 41

Sarajevo School of Science and Technology 43

Sarajevo School of Science and Technology 44

Sarajevo School of Science and Technology 45

Sarajevo School of Science and Technology 46

Sarajevo School of Science and Technology 47

You might also like