0% found this document useful (0 votes)
81 views95 pages

Introduction To Data Mining, 2 Edition: by Tan, Steinbach, Karpatne, Kumar

The document introduces the topic of data mining. It states that there has been enormous growth in data collection due to advances in technology. It also notes that data mining can help uncover valuable patterns from data for both commercial and scientific purposes. Finally, it provides an overview of some common data mining tasks like prediction, description, and clustering.

Uploaded by

sunilme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views95 pages

Introduction To Data Mining, 2 Edition: by Tan, Steinbach, Karpatne, Kumar

The document introduces the topic of data mining. It states that there has been enormous growth in data collection due to advances in technology. It also notes that data mining can help uncover valuable patterns from data for both commercial and scientific purposes. Finally, it provides an overview of some common data mining tasks like prediction, description, and clustering.

Uploaded by

sunilme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 95

Data Mining: Introduction

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

01/17/2018 Introduction to Data Mining, 2nd Edition 1


Large-scale Data is Everywhere!
 There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
 New mantra
 Gather whatever data you can
whenever and wherever
possible.
 Expectations
 Gathered data will have value
Traffic Patterns Social Networking:
either for the purpose Twitter
collected or for a purpose not
envisioned.

Sensor Networks Computational Simula


01/17/2018 Introduction to Data Mining, 2nd Edition 2
Why Data Mining? Commercial Viewpoint

 Lots of data is being collected


and warehoused
– Web data
Yahoo has Peta Bytes of web data
Facebook has billions of active users

– purchases at department/
grocery stores, e-commerce
 Amazon handles millions of visits/day
– Bank/Credit Card transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)

01/17/2018 Introduction to Data Mining, 2nd Edition 3


Why Data Mining? Scientific Viewpoint

 Data collected and stored at


enormous speeds
– remote sensors on a satellite
 NASA EOSDIS archives over
petabytes of earth science data / year fMRI Data from Brain Sky Survey Data

– telescopes scanning the skies


 Sky survey data

– High-throughput biological data


– scientific simulations
 terabytes of data generated in a few hours Gene Expression Data

 Data mining helps scientists


– in automated analysis of massive datasets
– In hypothesis formation
Surface Temperature of Earth
01/17/2018 Introduction to Data Mining, 2nd Edition 4
Great opportunities to improve productivity in all walks of life

01/17/2018 Introduction to Data Mining, 2nd Edition 5


Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by


Finding alternative/ green energy sources
increasing agriculture production
01/17/2018 Introduction to Data Mining, 2nd Edition 6
What is Data Mining?
 Many Definitions
– Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

01/17/2018 Introduction to Data Mining, 2nd Edition 7


What is (not) Data Mining?

What is not Data  What is Data Mining?


Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US
directory locations (O’Brien, O’Rourke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g., Amazon
rainforest, Amazon.com)
01/17/2018 Introduction to Data Mining, 2nd Edition 8
Origins of Data Mining
 Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems

 Traditional techniques may be unsuitable due to data that is


– Large-scale
– High dimensional
– Heterogeneous
– Complex
– Distributed

 A key component of the emerging field of data science and data-


driven discovery
01/17/2018 Introduction to Data Mining, 2nd Edition 9
Data Mining Tasks

 Prediction Methods
– Use some variables to predict unknown or
future values of other variables.

 Description Methods
– Find human-interpretable patterns that
describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

01/17/2018 Introduction to Data Mining, 2nd Edition 10


Data Mining Tasks …

Clu
ste Data
rin
g
Tid Refund Marital
Status
Taxable
Income Cheat
l i ng
1 Yes Single 125K No
d e
2 No Married 100K No
M o
i ve
3 No Single 70K No
4 Yes Married 120K No

ic t
ed
5 No Divorced 95K Yes

Pr
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
An
De oma
11 No Married 60K No
n
tio
12 Yes Divorced 220K No

ci a tec ly
13 No Single 85K Yes

s o 14 No Married 75K No
tio
As 10
15 No Single 90K Yes
n

ul es
R

Milk

01/17/2018 Introduction to Data Mining, 2nd Edition 11


Predictive Modeling: Classification
 Find a model for class attribute as a function of
the values of other attributes Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

01/17/2018 Introduction to Data Mining, 2nd Edition 12


Classification Example

cal cal t ive # years at


ri ori ita Level of Credit
ego eg an t
ss
Tid Employed
Education
present
Worthy
t t a address
ca ca qu cl 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10 Test
Set

Training
Learn
Model
Set Classifier

01/17/2018 Introduction to Data Mining, 2nd Edition 13


Examples of Classification Task

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying land covers (water bodies, urban areas,


forests, etc.) using satellite data

 Categorizing news stories as finance,


weather, entertainment, sports, etc

 Identifying intruders in the cyberspace

 Predicting tumor cells as benign or malignant

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

01/17/2018 Introduction to Data Mining, 2nd Edition 14


Classification: Application 1

 Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
 Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
 Label past transactions as fraud or fair transactions. This
forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card
transactions on an account.

01/17/2018 Introduction to Data Mining, 2nd Edition 15


Classification: Application 2

 Churn prediction for telephone customers


– Goal: To predict whether a customer is likely
to be lost to a competitor.
– Approach:
 Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

01/17/2018 Introduction to Data Mining, 2nd Edition 16


Classification: Application 3
 Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
 Segment the image.
 Measure image attributes (features) - 40 of them per
object.
 Model the class based on these features.
 Success Story: Could find 16 new high red-shift
quasars, some of the farthest objects that are difficult
to find! From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

01/17/2018 Introduction to Data Mining, 2nd Edition 17


Classifying Galaxies
Courtesy: http://aps.umn.edu

Early Class: Attributes:


• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB

01/17/2018 Introduction to Data Mining, 2nd Edition 18


Regression

 Predict a value of a given continuous valued variable


based on the values of other variables, assuming a
linear or nonlinear model of dependency.
 Extensively studied in statistics, neural network fields.
 Examples:
– Predicting sales amounts of new product based on
advetising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.

01/17/2018 Introduction to Data Mining, 2nd Edition 19


Clustering

 Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

01/17/2018 Introduction to Data Mining, 2nd Edition 20


Applications of Cluster Analysis
 Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
 Summarization
– Reduce the size of large data
sets Courtesy: Michael Eisen

Clusters for Raw SST and Raw NPP


90

Use of K-means to
partition Sea Surface
60

Land Cluster 2

30
Temperature (SST) and
Land Cluster 1 Net Primary Production
latitude

0
(NPP) into clusters that
Ice or No NPP

-30
reflect the Northern
Sea Cluster 2 and Southern
-60
Hemispheres.
Sea Cluster 1

-90
-180 -150
01/17/2018
-120 -90 -60 -30 0 30 60 90 120 150 180
Introduction to Data Mining, 2nd Edition 21
Cluster
longitude
Clustering: Application 1

 Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a
market target to be reached with a distinct marketing mix.
– Approach:
 Collect different attributes of customers based on their
geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.

01/17/2018 Introduction to Data Mining, 2nd Edition 22


Clustering: Application 2

 Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.

– Approach: To identify frequently occurring terms in


each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.

Enron email dataset

01/17/2018 Introduction to Data Mining, 2nd Edition 23


Association Rule Discovery: Definition

 Given a set of records each of which contain


some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.

TID Items
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

01/17/2018 Introduction to Data Mining, 2nd Edition 24


Association Analysis: Applications

 Market-basket analysis
– Rules are used for sales promotion, shelf management,
and inventory management

 Telecommunication alarm diagnosis


– Rules are used to find combination of alarms that occur
together frequently in the same time period

 Medical Informatics
– Rules are used to find combination of patient symptoms
and test results associated with certain diseases

01/17/2018 Introduction to Data Mining, 2nd Edition 25


Association Analysis: Applications

 An Example Subspace Differential Coexpression Pattern


from lung cancer dataset Three lung cancer datasets [Bhattacharjee et al
2001], [Stearman et al. 2005], [Su et al. 2007]

Enriched with the TNF/NFB signaling pathway


which is well-known to be related to lung cancer
P-value: 1.4*10-5 (6/10 overlap with the pathway)

[Fang et al PSB 2010]


01/17/2018 Introduction to Data Mining, 2nd Edition 26
Deviation/Anomaly/Change Detection
 Detect significant deviations from
normal behavior
 Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
– Identify anomalous behavior from
sensor networks for monitoring and
surveillance.
– Detecting changes in the global forest
cover.

01/17/2018 Introduction to Data Mining, 2nd Edition 27


Motivating Challenges

 Scalability

 High Dimensionality

 Heterogeneous and Complex Data

 Data Ownership and Distribution

 Non-traditional Analysis

01/17/2018 Introduction to Data Mining, 2nd Edition 28


What Is a Data Warehouse

 Data warehousing provides architectures and tools for


business executives to systematically organize,
understand, and use their data to make strategic
decisions
 Data warehouse refers to a data repository that is
maintained separately from an organization’s operational
databases
 According to William H. Inmon, a leading architect in the
construction of data warehouse systems, “A data
warehouse is a subject- oriented, integrated, time-variant,
and nonvolatile collection of data in support of
management’s decision making process”

01/17/2018 Introduction to Data Mining, 2nd Edition 29


features of a data warehouse
 Subject-oriented: customer, supplier, product,
and sales rather day-to-day operations and
transaction processing of an organization
 Integrated: relational databases, flat files, and
online transaction records
 Time-variant: Data are stored to provide
information from an historic perspective
 Nonvolatile: A data warehouse is always a
physically separate store of data transformed
from the application data found in the operational
environment
01/17/2018 Introduction to Data Mining, 2nd Edition 30
 The major task of online operational database systems is to perform
online transaction and query processing. These systems are called
online transaction processing (OLTP) systems
 purchasing, inventory, manufacturing, banking, payroll, registration,
and accounting
 Data warehouse systems, on the other hand, serve users or
knowledge workers in the role of data analysis and decision making.
Such systems can organize and present data in various formats in
order to accommodate the diverse needs of different users. These
systems are known as online analytical processing (OLAP)
systems

01/17/2018 Introduction to Data Mining, 2nd Edition 31


 Users and system orientation
 Data contents
 Database design
 View
 Access patterns:

01/17/2018 Introduction to Data Mining, 2nd Edition 32


 Users and system orientation: customer-
oriented and is used for transaction and query
processing by clerks, clients, and information
technology professionals
 An OLAP system is market-oriented and is used
for data analysis by knowledge workers, including
managers, executives, and analysts

01/17/2018 Introduction to Data Mining, 2nd Edition 33


 Data contents : An OLTP system manages
current data that, typically, are too detailed to be
easily used for decision making
 An OLAP system manages large amounts of
historic data, provides facilities for summarization
and aggregation, and stores and manages
information at different levels of granularity.

01/17/2018 Introduction to Data Mining, 2nd Edition 34


 Database design: An OLTP system usually
adopts an entity-relationship (ER) data model and
an application-oriented database design
 An OLAP system typically adopts either a star or
a snowflake model and a subject-oriented
database design

01/17/2018 Introduction to Data Mining, 2nd Edition 35


 View: An OLTP system focuses mainly on the
current data within an enterprise or department,
without referring to historic data or data in
different organizations
 an OLAP system often spans multiple versions of
a database schema, due to the evolutionary
process of an organization

01/17/2018 Introduction to Data Mining, 2nd Edition 36


 Access patterns: The access patterns of an
OLTP system consist mainly of short, atomic
transactions. Such a system requires
concurrency control and recovery mechanisms
 accesses to OLAP systems are mostly read-only
operations

01/17/2018 Introduction to Data Mining, 2nd Edition 37


 But, Why Have a Separate DataWarehouse?
 Why not perform online analytical processing
directly on such databases instead of spending
additional time and resources to construct a
separate data warehouse?”
– promote the high performance of both
systems. An operational database is designed
and tuned from known tasks and workloads
like indexing and hashing using primary keys,
searching for particular records, and
optimizing “canned” queries
01/17/2018 Introduction to Data Mining, 2nd Edition 38
 data warehouse queries are often complex. They
involve the computation of large data groups at
summarized levels, and may require the use of
special data organization, access, and
implementation methods based on
multidimensional views. Processing OLAP
queries in operational databases would
substantially degrade the performance of
operational tasks

01/17/2018 Introduction to Data Mining, 2nd Edition 39


01/17/2018 Introduction to Data Mining, 2nd Edition 40
A Multitiered Architecture

01/17/2018 Introduction to Data Mining, 2nd Edition 41


 The bottom tier is a warehouse database server
that is almost always a relational database
system
 The data are extracted using application program
interfaces known as gateways. gateways include
ODBC (Open Database Connection) and OLEDB
by Microsoft and JDBC (Java Database
Connection).

01/17/2018 Introduction to Data Mining, 2nd Edition 42


 The middle tier is an OLAP server that is
typically implemented using either (1) a
relationalOLAP(ROLAP) model (i.e., an
extended relational DBMS that maps
operations on multidimensional data to standard
relational operations); or (2) a multidimensional
OLAP (MOLAP) model
 The top tier is a front-end client layer, which
contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend
analysis, prediction, and so on).
01/17/2018 Introduction to Data Mining, 2nd Edition 43
DataWarehouse Models

 From the architecture point of view, there are


three data warehouse models:
 The enterprise warehouse,
 The data mart
 The virtual warehouse

01/17/2018 Introduction to Data Mining, 2nd Edition 44


enterprise warehouse
 An enterprise warehouse collects all of the
information about subjects spanning the entire
organization
 It contains detailed data as well as summarized
data, and can range in size from a few gigabytes
to hundreds of gigabytes, terabytes, or beyond
 An enterprise data warehouse may be
implemented on traditional mainframes, computer
super servers, or parallel architecture platforms.

01/17/2018 Introduction to Data Mining, 2nd Edition 45


01/17/2018 Introduction to Data Mining, 2nd Edition 46
Data mart

 data mart contains a subset of corporate-wide


data that is of value to a specific group of users
 The scope is confined to specific selected
subjects. For example, a marketing data mart
may confine its subjects to customer, item, and
sales
 Data marts are usually implemented on low-cost
departmental servers that are Unix/Linux or
Windows based
 Depending on the source of data, data marts can
be categorized as independent or dependent
01/17/2018 Introduction to Data Mining, 2nd Edition 47
01/17/2018 Introduction to Data Mining, 2nd Edition 48
Virtual warehouse

 A virtual warehouse is a set of views over


operational databases. For efficient query
processing, only some of the possible summary
views may be materialized
 virtual warehouse is easy to build but requires
excess capacity on operational database servers

01/17/2018 Introduction to Data Mining, 2nd Edition 49


01/17/2018 Introduction to Data Mining, 2nd Edition 50
 Data warehouse systems use back-end tools and utilities to populate
and refresh their data
 These tools and utilities include the following functions:
 Data extraction, which typically gathers data from multiple,
heterogeneous, and external sources.
 Data cleaning, which detects errors in the data and rectifies them
when possible.
 Data transformation, which converts data from legacy or host format
to warehouse format.
 Load, which sorts, summarizes, consolidates, computes views,
checks integrity, and builds indices and partitions.
 Refresh, which propagates the updates from the data sources to the
warehouse

01/17/2018 Introduction to Data Mining, 2nd Edition 51


Data Cube

 A Multidimensional Data Model


 A data cube allows data to be modeled and viewed in
multiple dimensions. It is defined by dimensions and
facts.
 Dimensions are the perspectives or entities with
respect to which an organization wants to keep records
 All Electronics may create a sales data arehouse in order
to keep records of the store’s sales with respect to the
dimensions time, item, branch, and location.
 Dimension table for item may contain the attributes item
name, brand, and type

01/17/2018 Introduction to Data Mining, 2nd Edition 52


 Facts are numeric measures. Think of them as the
quantities by which we want to analyze relationships
between dimensions
 The fact table contains the names of the facts, or
measures, as well as keys to each of the related
dimension tables.

01/17/2018 Introduction to Data Mining, 2nd Edition 53


01/17/2018 Introduction to Data Mining, 2nd Edition 54
01/17/2018 Introduction to Data Mining, 2nd Edition 55
 The 0-D cuboid, which holds the highest level of
summarization, is called the apex cuboid. In our
example, this is the total sales, or dollars sold,
summarized over all four dimensions

01/17/2018 Introduction to Data Mining, 2nd Edition 56


 The most popular data model for a data warehouse is a
multidimensional model, which can exist in the form of
a star schema, a snowflake schema, or a fact
constellation schema
 Star schema: The most common modeling paradigm
is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing
the bulk of the data, with no redundancy, and (2) a set of
smaller attendant tables (dimension tables), one for
each dimension

01/17/2018 Introduction to Data Mining, 2nd Edition 57


01/17/2018 Introduction to Data Mining, 2nd Edition 58
 Snowflake schema: The snowflake schema is a
variant of the star schema model, where some
dimension tables are normalized, thereby further
splitting the data into additional tables. The
resulting schema graph forms a shape similar to a
snowflake
 The major difference between the snowflake and
star schema models is that the dimension tables of
the snowflake model may be kept in normalized
form to reduce redundancies
 location key, street, city, province or state, country.

01/17/2018 Introduction to Data Mining, 2nd Edition 59


 Fact constellation: Sophisticated applications
may require multiple fact tables to share
dimension tables. This kind of schema can be
viewed as a collection of stars, and hence is
called a galaxy schema or a fact constellation.

01/17/2018 Introduction to Data Mining, 2nd Edition 60


01/17/2018 Introduction to Data Mining, 2nd Edition 61
 A concept hierarchy defines a sequence of
mappings from a set of low-level concepts to
higher-level, more general concepts. Consider a
concept hierarchy for the dimension location.

01/17/2018 Introduction to Data Mining, 2nd Edition 62


 the attributes of a dimension may be organized in
a partial order, forming a lattice. An example of a
partial order for the time dimension based on the
attributes day, week, month, quarter, and year is
“day <{month < quarter; week} < year.”

01/17/2018 Introduction to Data Mining, 2nd Edition 63


 A concept hierarchy that is a total or partial order among
attributes in a database schema is called a schema
hierarchy.
 Concept hierarchies may also be defined by discretizing
or grouping values for a given dimension or attribute,
resulting in a set-grouping hierarchy.
 A total or partial order can be defined among groups of
values. An example of a set-grouping hierarchy is shown
in Figure 4.11 for the dimension price, where an interval .
$X : : :$Y] denotes the range from $X (exclusive) to $Y
(inclusive)

01/17/2018 Introduction to Data Mining, 2nd Edition 64


 A data cube measure is a numeric function
that can be evaluated at each point in the data
cube space
 Measures can be organized into three categories
—distributive, algebraic, and holistic—based on
the kind of aggregate functions used
 Distributive: An aggregate function is
distributive if it can be computed in a
distributed manner as follows. Suppose the data
are partitioned into n sets. We apply the function
to each partition, resulting in n aggregate values
01/17/2018 Introduction to Data Mining, 2nd Edition 65
 count(), min(), and max() are distributive aggregate
functions
 Algebraic: An aggregate function is algebraic if it can
be computed by an algebraic function with M
arguments (where M is a bounded positive integer), each
of which is obtained by applying a distributive aggregate
 Holistic: An aggregate function is holistic if there is
no constant bound on the storage size needed to
describe a sub aggregate. That is, there does not exist an
algebraic function with M arguments (where M is a
constant) that characterizes the computation.
 median(), mode(), and rank(). A measure is holistic if it is
obtained by applying a holistic aggregate function
01/17/2018 Introduction to Data Mining, 2nd Edition 66
Typical OLAP Operations

01/17/2018 Introduction to Data Mining, 2nd Edition 67


Typical OLAP Operations

 The roll-up operation (also called the drill-up operation


by some vendors)performs aggregation on a data cube,
either by climbing up a concept hierarchy for a dimension
or by dimension reduction.
 when roll-up is performed by dimension reduction, one or
more dimensions are removed from the given cube
 Drill-down: Drill-down is the reverse of roll-up. It
navigates from less detailed data to more detailed data.
Drill-down can be realized by either stepping down a
concept hierarchy for a dimension or introducing
additional dimensions

01/17/2018 Introduction to Data Mining, 2nd Edition 68


01/17/2018 Introduction to Data Mining, 2nd Edition 69
01/17/2018 Introduction to Data Mining, 2nd Edition 70
 Slice and dice: The slice operation performs a selection
on one dimension of the given cube, resulting in a
subcube.
 The dice operation defines a sub cube by performing a
selection on two or more dimensions.
 Pivot (rotate): Pivot (also called rotate) is a visualization
operation that rotates the data axes in view to provide an
alternative data presentation.
 Other OLAP operations: Some OLAP systems offer
additional drilling operations. For example, drill-across
executes queries involving (i.e., across) more than one
fact table. The drill-through operation uses relational SQL
facilities to drill through the bottom level of a data cube
down to its back-end relational tables.
01/17/2018 Introduction to Data Mining, 2nd Edition 71
01/17/2018 Introduction to Data Mining, 2nd Edition 72
01/17/2018 Introduction to Data Mining, 2nd Edition 73
Module-2

Data warehouse implementation&


Data mining

01/17/2018 Introduction to Data Mining, 2nd Edition 74


 Efficient Data Cube computation: An overview
 Indexing OLAP Data: Bitmap index and join index
 Efficient processing of OLAP Queries
 OLAP server Architecture
 ROLAP versus MOLAP Versus HOLAP
 What is data mining, Challenges, Data Mining
Tasks
 Data: Types of Data, Data Quality, Data
Preprocessing
 Measures of Similarity and Dissimilarity

01/17/2018 Introduction to Data Mining, 2nd Edition 75


Efficient Data Cube computation
 Data analysis is efficient computation of
aggregations across many sets of dimensions
 In SQL terms, these aggregations are referred to
as group-by’s
 group-by can be represented by a cuboid
 set of group-by’s forms a lattice of cuboids
defining a data cube

01/17/2018 Introduction to Data Mining, 2nd Edition 76


The compute cube Operator and
the Curse of Dimensionality
 SQL includes a compute cube operator
 aggregates over all subsets of the dimensions
specified in the operation.
 require excessive storage space, especially for
large numbers of dimensions

01/17/2018 Introduction to Data Mining, 2nd Edition 77


efficient computation of data cubes.

 A data cube is a lattice of cuboids.


 create a data cube for AllElectronics sales that
contains the following: city, item, year, and sales
in dollars
 Compute the sum of sales, grouping by city and
item.
 Compute the sum of sales, grouping by city.
 Compute the sum of sales, grouping by item.

What is the total number of cuboids, or group-by’s,


that can be computed for this data cube?

01/17/2018 Introduction to Data Mining, 2nd Edition 78


 Taking the three attributes city, item, and year, as the
dimensions for the data cube, and sales in dollars as the
measure
 total number of cuboids, or group by’s is 2^3=8
 {(city, item, year), (city, item), (city, year), (item, year),
(city), (item),(year), ()},

01/17/2018 Introduction to Data Mining, 2nd Edition 79


 The base cuboid contains all three dimensions, city, item, and
year. It can return the total sales for any combination of the three
dimensions
 The apex cuboid, or 0-D cuboid, refers to the case where the group-
by is empty It contains the total sum of all sales
 The base cuboid is the least generalized (most specific) of the
cuboids. The apex cuboid is the most generalized (least specific) of
the cuboids
 no group-by: zero dimensional operation ( compute the sum of total
sales)
 one group-by: one-dimensional operation (compute the sum of sales,
group-by city)
 define cube sales cube [city, item, year]: sum(sales in dollars) For a
cube with n dimensions

01/17/2018 Introduction to Data Mining, 2nd Edition 80


 The storage requirements are even more excessive when many of
the dimensions have associated concept hierarchies, each with
multiple levels. This problem is referred to as the curse of
dimensionality
 time is usually explored not at only one conceptual level (e.g., year),
but rather at multiple conceptual levels such as in the hierarchy “day
<month < quarter < year.
 For an n-dimensional data cube, the total number of cuboids is

 where Li is the number of levels associated with dimension i. One is


added to Li to include the virtual top level,

01/17/2018 Introduction to Data Mining, 2nd Edition 81


Partial Materialization

 Partial Materialization: Selected Computation of


Cuboids
 No materialization: Do not precompute any of the
“nonbase” cuboids. This leads to computing expensive
multidimensional aggregates on-the-fly, which can be
extremely slow
 Full materialization: Precompute all of the cuboids.
The resulting lattice of computed cuboids is referred to
as the full cube. This choice typically requires huge
amounts of memory space in order to store all of the
precomputed cuboids

01/17/2018 Introduction to Data Mining, 2nd Edition 82


 Partial materialization: Selectively compute a proper
subset of the whole set of possible cuboids.
lternatively, we may compute a subset of the cube, which
contains only those cells that satisfy some user-specified
criterion, such as where the tuple count of each cell is
above some threshold. We will use the term subcube to
refer to the latter case, where only some of the cells may
be precomputed for various cuboids. Partial
materialization represents an interesting trade-off
between storage space and response time.

01/17/2018 Introduction to Data Mining, 2nd Edition 83


 partial materialization of cuboids or subcubes should
consider three factors
– identify the subset of cuboids or subcubes to
materialize
– exploit the materialized cuboids or subcubes during
query processing
– efficiently update the materialized cuboids or
subcubes during load and refresh

01/17/2018 Introduction to Data Mining, 2nd Edition 84


 The selection of the subset of cuboids or subcubes to
materialize should take into account the queries in the
workload, their frequencies, and their accessing costs
 cost for incremental updates, and the total storage,
physical database design such as the generation and
selection of indices
 compute an iceberg cube, which is a data cube that
stores only those cube cells with an ggregate value (e.g.,
count) that is above some minimum support threshold
 materialize a shell cube. This involves precomputing the
cuboids for only a small number of dimensions (e.g., three
to five) of a data cube. Queries on additional ombinations
of the dimensions can be computed on-thefly.
01/17/2018 Introduction to Data Mining, 2nd Edition 85
 Once the selected cuboids have been materialized, it is
important to take advantage of them during query
processing. This involves several issues, such as how to
determine the relevant cuboid(s) from among the
candidate materialized cuboids, how to use available
index structures on the materialized cuboids, and how to
transform the OLAP operations onto the selected
cuboid(s)
 Finally, during load and refresh, the materialized cuboids
should be updated efficiently. Parallelism and incremental
update techniques for this operation should be explored.

01/17/2018 Introduction to Data Mining, 2nd Edition 86


Indexing OLAP Data: Bitmap Index and Join
Index
 The bitmap indexing method is popular in OLAP products
because it allows quick searching in data cubes
 The bitmap index is an alternative representation of the record ID
(RID) list
 In the bitmap index for a given attribute, there is a distinct bit vector,
Bv, for each value v in the attribute’s domain
 attribute’s domain consists of n values, then n bits are needed for
each entry in the bitmap index
 If the attribute has the value v for a given row in the data table, then
the bit representing that value is set to 1 in the corresponding row of
the bitmap index. All other bits for that row are set to 0

01/17/2018 Introduction to Data Mining, 2nd Edition 87


01/17/2018 Introduction to Data Mining, 2nd Edition 88
 useful for low-cardinality domains because comparison, join, and
aggregation operations are then reduced to bit arithmetic, which
substantially reduces the processing time
 leads to significant reductions in space and input/output since a string
of characters can be represented by a single bit.
 higher-cardinality domains,the method can be adapted using
compression techniques

01/17/2018 Introduction to Data Mining, 2nd Edition 89


Join indexing

 join indexing registers the joinable rows of two relations from a


relational database
 if two relations R(RID, A) and S(B, SID) join on the attributes A and B,
then the join index record contains the pair .RID, SID/, where RID
and SID are record identifiers from the R and S relations
 Join index records can identify joinable tuples without performing
costly join operations
 Join indexing is especially useful for maintaining the relationship
between a foreign key and its matching primary keys, from the
joinable relation
 Join indices may span multiple dimensions to form composite join
indices

01/17/2018 Introduction to Data Mining, 2nd Edition 90


Efficient Processing of OLAP Queries

 Determine which operations should be performed on the


available cuboids: This involves transforming any selection,
projection, roll-up (group-by), and drill-down operations specified in
the query into corresponding SQL and/or OLAP operations
 slicing and dicing a data cube may correspond to selection and/or
projection operations on a materialized cuboid.
 Determine to which materialized cuboid(s) the relevant
operations should be applied:This involves identifying all of the
materialized cuboids that may potentially
 be used to answer the query, pruning the set using knowledge of
“dominance”
 relationships among the cuboids, estimating the costs of using the
remaining
 materialized cuboids, and selecting the cuboid with the least cost.

01/17/2018 Introduction to Data Mining, 2nd Edition 91


 OLAP query processing. Suppose that we define a data cube for
AllElectronics of the form“sales cube [time, item, location]: sum(sales
in dollars).” The dimension hierarchies used are “day < month <
quarter < year” for time; “item name < brand < type” for item; and
“street < city < province_or_state < country” for location.
 Suppose that the query to be processed is on {brand,
province_or_state}, with the selection constant “year = 2010.”
 cuboid 1: {year, item name, city}
 cuboid 2: {year, brand, country}
 cuboid 3: {year, brand, province or state}
 cuboid 4: {item name, province or state}, where year = 2010

“Which of these four cuboids should be selected to process the query?”

01/17/2018 Introduction to Data Mining, 2nd Edition 92


OLAP Server Architectures
 Relational OLAP (ROLAP) servers: These are the
intermediate servers that stand in between a
relational back-end server and client front-end tools.
They use a relational or extended-relational DBMS to
store and manage warehouse data, and OLAP
middleware to support missing pieces. ROLAP servers
include optimization for each DBMS back end,
implementation of aggregation navigation logic, and
additional tools and services. ROLAP technology tends
to have greater scalability than MOLAP technology.

01/17/2018 Introduction to Data Mining, 2nd Edition 93


 Multidimensional OLAP (MOLAP) servers: These
servers support multidimensional data views
through array-based multidimensional storage engines.
They map multidimensional views directly to data cube
array structures. The advantage of using a data cube is
that it allows fast indexing to precomputed summarized
data. Notice that with multidimensional data stores, the
storage utilization may be low if the data set is sparse.

01/17/2018 Introduction to Data Mining, 2nd Edition 94


 Hybrid OLAP (HOLAP) servers: The hybrid OLAP
approach combines ROLAP and MOLAP technology,
benefiting from the greater scalability of ROLAP and
the faster computation of MOLAP. For example, a
HOLAP server may allow large volumes of detailed data
to be stored in a relational database, while aggregations
are kept in a separate MOLAP store. The Microsoft SQL
Server 2000 supports a hybrid OLAP server.

01/17/2018 Introduction to Data Mining, 2nd Edition 95

You might also like