Module 1 - Data Warehousing & Modeling F.0
Module 1 - Data Warehousing & Modeling F.0
Suresh Y
Module – 1:
Data Warehousing & modeling: Basic Concepts: Data Warehousing: A multitier
Architecture, Data warehouse models: Enterprise warehouse, Data mart and virtual
warehouse, Extraction, Transformation and loading, Data Cube: A multidimensional data
model, Stars, Snowflakes and Fact constellations: Schemas for multidimensional Data
models, Dimensions: The role of concept Hierarchies, Measures: Their Categorization and
computation, Typical OLAP Operations.
Module – 2:
Data warehouse implementation& Data mining: Efficient Data Cube computation: An
overview, Indexing OLAP Data: Bitmap index and join index, Efficient processing of OLAP
Queries, OLAP server Architecture ROLAP versus MOLAP Versus HOLAP. : Introduction:
What is data mining, Challenges, Data Mining Tasks, Data: Types of Data, Data Quality,
Data Pre-processing, Measures of Similarity and Dissimilarity.
Module – 3:
Association Analysis: Association Analysis: Problem Definition, Frequent Item set
Generation, Rule generation. Alternative Methods for Generating Frequent Item sets, FP-
Growth Algorithm, Evaluation of Association Patterns.
Module – 4:
Classification: Decision Trees Induction, Method for Comparing Classifiers, Rule Based
Classifiers, Nearest Neighbor Classifiers, Bayesian Classifiers.
Module – 5:
Clustering Analysis: Overview, K-Means, Agglomerative Hierarchical Clustering,
DBSCAN, Cluster Evaluation, Density-Based Clustering, Graph-Based Clustering, Scalable
Clustering Algorithms.
Text Books:
1. Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining,
Pearson, First impression,2014.
2. Jiawei Han, Micheline Kamber, Jian Pei: Data Mining -Concepts and Techniques, 3rd
Edition, Morgan Kaufmann Publisher, 2012.
Reference Books:
1. Sam Anahory, Dennis Murray: Data Warehousing in the Real World, Pearson, Tenth
Impression,2012.
2. Michael.J.Berry, Gordon.S.Linoff: Mastering Data Mining , Wiley Edition, 2nd
edtion,2012.
Figure 2: shows the typical framework for construction & use of a data warehouse for
AllElectronics.
ETL
An ODS or a data warehouse is based on a single global schema that integrates and
consolidates enterprise information from many sources. Building such a system requires data
acquisition from OLTP and legacy systems. The ETL process involves extracting,
transforming and loading data from source systems. The process may sound very simple since
it only involves reading information from source databases, transforming it to fit the ODS
database model and loading it in the ODS.
As different data sources tend to have different conventions for coding information and
different standards for the quality of information, building an ODS requires data filtering, data
cleaning, and integration.
If an enterprise wishes to contact its customers or its suppliers, it is essential that a complete,
accurate and up-to-date list of contact addresses, email addresses and telephone numbers be
available. Correspondence sent to a wrong address that is then redirected does not create a
very good impression about the enterprise.
If a customer or supplier calls, the staff responding should be quickly ale to find the person in
the enterprise database but this requires that the caller‘s name or his/her company name is
accurately listed in the database.
If a customer appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer‘s information.
It has been suggested that data cleaning should be based on the following five steps:
1. Parsing: Parsing identifies various components of the source data files and then
establishes relationships between those and the fields in the target files. The classical
example of parsing is identifying the various components of a person‘s name and address.
3. Standardizing: Business rules of the enterprise may now be used to transform the data to
standard form. For example, in some companies there might be rules on how name and
address are to be represented.
4. Matching: Much of the data extracted from a number of source systems is likely to be
related. Such data needs to be matched.
5. Consolidating: All corrected, standardized and matched data can now be consolidated to
build a single version of the enterprise data.
2. Hardware integration: Once the hardware and software have been selected, they need to be
put together by integrating the servers, the storage devices and the client software tools.
3. Modelling: Modelling is a major step that involves designing the warehouse schema and
views. This may involve using a modelling tool if the data warehouse is complex.
4. Physical modelling: For the data warehouse to perform efficiently, physical modelling is
required. This involves designing the physical data warehouse organization, data placement,
data partitioning, deciding on access methods and indexing.
5. Sources: The data for the data warehouse is likely to come from a number of data sources.
This step involves identifying and connecting the sources using gateways, ODBC drives or
other wrappers.
6. ETL: The data from the source systems will need to go through an ETL process. The step of
designing and implementing the ETL process may involve identifying a suitable ETL tool
vendor and purchasing and implementing the tool. This may include customizing the tool to
suit the needs of the enterprise.
7. Populate the data warehouse: Once the ETL tools have been agreed upon, testing the tools
will be required, perhaps using a staging area. Once everything is working satisfactorily, the
ETL tools may be used in populating the warehouse given the schema and view definitions.
8. User applications: For the data warehouse to be useful there must be end-user applications.
This step involves designing and implementing applications required by the end users.
9. Roll-out the warehouse and applications: Once the data warehouse has been populated and
the end-user applications tested, the warehouse system and the applications may be rolled out
for the user community to use.
Implementation Guidelines
2. Need a champion: A data warehouse project must have a champion who is willing to carry
out considerable research into expected costs and benefits of the project. Data warehousing
projects require inputs from many units in an enterprise and therefore need to be driven by
someone who is capable of interaction with people in the enterprise and can actively persuade
colleagues. Without the cooperation of other units, the data model for the warehouse and the
data required to populate the warehouse may be more complicated than they need to be.
Studies have shown that having a champion can help adoption and success of data
warehousing projects.
3. Senior management support: A data warehouse project must be fully supported by the senior
management. Given the resource intensive nature of such projects and the time they can take
to implement, a warehouse project calls for a sustained commitment from senior
management. This can sometimes be difficult since it may be hard to quantify the benefits of
data warehouse technology and the managers may consider it a cost without any explicit
return on investment. Data warehousing project studies show that top management support is
essential for the success of a data warehousing project.
4. Ensure quality: Only data that has been cleaned and is of a quality that is understood by the
organization should be loaded in the data warehouse. The data quality in the source systems is
not always high and often little effort is made to improve data quality in the source systems.
Improved data quality, when recognized by senior managers and stakeholders, is likely to lead
to improved support for a data warehouse project.
5. Corporate strategy: A data warehouse project must fit with corporate strategy and business
objectives. The objectives of the project must be clearly defined before the start of the project.
Given the importance of senior management support for a data warehousing project, the
fitness of the project with the corporate strategy is essential.
6. Business plan: The financial costs (hardware, software, people-ware), expected benefits and a
project plan (including an ETL plan) for a data warehouse project must be clearly outlined
and understood by all stakeholders. Without such understanding, rumours about expenditure
and benefits can become the only source of information, undermining the project.
7. Training: A data warehouse project must not overlook data warehouse training requirements.
For a data warehouse project to be successful, the users must be trained to use the warehouse
and to understand its capabilities. Training of users and professional development of the
project team may also be required since data warehousing is a complex task and the skills of
the project team are critical to the success of the project.
8. Adaptability: The project should build in adaptability so that changes may be made to the data
warehouse if and when required. Like any system, a data warehouse will need to change, as
needs of an enterprise change. Furthermore, once the data warehouse is operational, new
applications using the data warehouse are almost certain to be proposed. The system should
be able to support such new applications.
9. Joint management: The project must be managed by both IT and business professionals in
the enterprise. To ensure good communication with the stakeholders and that the project is
focused on assisting the enterprise‘s business, business professionals must be involved in the
project along with technical professionals.
1. Statistics
For example, in data mining tasks like data characterization and classification,
statistical models of target classes can be built. In other words, such statistical models can be
the outcome of a data mining task. Alternatively, data mining tasks can be built on top of
statistical models. For example, we can use statistics to model noise and missing data
values. Then, when mining patterns in a large data set, the data mining process can use the
model to help identify and handle noisy or missing values in the data.
Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical
hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data analysis)
makes statistical decisions using experimental data. A result is called statistically significant
if it is unlikely to have occurred by chance. If the classification or prediction model holds
true, then the descriptive statistics of the model increases the soundness of the model.
2. Machine Learning
Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data. For example, a typical
machine learning problem is to program a computer so that it can automatically recognize
handwritten postal codes on mail after learning from a set of examples. Machine learning is a
fast-growing discipline.
2a. Supervised learning is basically a synonym for classification. The supervision in the
learning comes from the labeled examples in the training data set. For example, in the postal
code recognition problem, a set of handwritten postal code images and their corresponding
machine-readable translations are used as the training examples, which supervise the learning
of the classification model.
2b. Unsupervised learning is essentially a synonym for clustering. The learning process is
unsupervised since the input examples are not class labeled. Typically, we may use clustering
to discover classes within the data. For example, an unsupervised learning method can take,
as input, a set of images of handwritten digits. Suppose that it finds 10 clusters of data. These
clusters may correspond to the 10 distinct digits of 0 to 9, respectively. However, since the
training data are not labeled, the learned model cannot tell us the semantic meaning of the
clusters found.
2c. Semi-supervised learning is a class of machine learning techniques that make use of
both labeled and unlabeled examples when learning a model. In one approach, labelled
examples are used to learn class models and unlabeled examples are used to refine the
boundaries between classes. For a two-class problem, we can think of the set of examples
belonging to one class as the positive examples and those belonging to the other class as the
negative examples.
2d. Active learning is a machine learning approach that lets users play an active role in the
learning process. An active learning approach can ask a user (e.g., a domain expert) to label
an example, which may be from a set of unlabeled examples or synthesized by the learning
program. The goal is to optimize the model quality by actively acquiring knowledge from
human users, given a constraint on how man examples they can be asked to label.
Many data mining tasks need to handle large data sets or even real-time, fast streaming data.
Therefore, data mining can make good use of scalable database technologies to achieve high
efficiency and scalability on large data sets. Moreover, data mining tasks can be used to
extend the capability of existing database systems to satisfy advanced users’ sophisticated
data analysis requirements.
Recent database systems have built systematic data analysis capabilities on database data
using data warehousing and data mining facilities. A data warehouse integrates data
originating from multiple sources and various timeframes. It consolidates data in
multidimensional space to form partially materialized data cubes. The data cube model not
only facilitates OLAP in multidimensional databases but also promotes multidimensional
data mining.
The following section describes data mining functionalities, and the kinds of patterns they
can discover
- Or -
Data mining functionalities: characterization, discrimination, association and correlation
analysis, classification, prediction, clustering, and evolution analysis (With examples of each
data mining functionality, using a real-life database)
1. Concept/class description: characterization and discrimination
Data can be associated with classes or concepts. It describes a given set of data in a
concise and summarative manner, presenting interesting general properties of the data.
These descriptions can be derived via
1. data characterization, by summarizing the data of the class under study (often called
the target class)
2. data discrimination, by comparison of the target class with one or a set of comparative
classes
3. both data characterization and discrimination
Dept. of CSE, BITM - Ballari Page 8
15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y
a. Data characterization
It is a summarization of the general characteristics or features of a target class of data.
Example: A data mining system should be able to produce a description summarizing the
characteristics of a student who has obtained more than 75% in every semester; the result
could be a general profile of the student.
b. Data Discrimination
Is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.
The general features of students with high GPA’s may be compared with the general features
of students with low GPA’s. The resulting description could be a general comparative profile
of the students such as 75% of the students with high GPA’s are fourth-year computing
science students while 65% of the students with low GPA’s are not.
The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be presented as generalized relations,
or in rule form called characteristic rules.
Typical Applications
- credit approval
- target marketing
- medical diagnosis
- treatment effectiveness analysis
Classification can be defined as the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown. The derived model is based on the
analysis of a set of training data (i.e., data objects whose class label is known).
Example:
An airport security screening station is used to deter mine if passengers are potential
terrorist or criminals. To do this, the face of each passenger is scanned and its basic
pattern(distance between eyes, size, and shape of mouth, head etc) is identified. This
pattern is compared to entries in a database to see if it matches any patterns that are
associated with known offenders
A classification model can be represented in various forms, such as
1) IF-THEN rules,
student ( class , "undergraduate") AND concentration ( level, "high") ==> class A
student (class ,"undergraduate") AND concentrtion (level,"low") ==> class B
student (class , "post graduate") ==> class C
2) Decision tree
Prediction:
Find some missing or unavailable data values rather than class labels referred to as
prediction. Although prediction may refer to both data value prediction and class label
prediction, it is usually confined to data value prediction and thus is distinct from
classification. Prediction also encompasses the identification of distribution trends based on
the available data.
Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various
points in the river. These monitors collect data relevant to flood prediction: water level, rain
amount, time, humidity etc. These water levels at a potential flooding point in the river can be
predicted based on the data collected by the sensors upriver from this point. The prediction
must be made with respect to the time the data were collected.
Example:
A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location and physical characteristics
of potential customers (age, height, weight, etc). To determine the target mailings of the
various catalogs and to assist in the creation of new, more specific catalogs, the company
performs a clustering of potential customers based on the determined attribute values. The
results of the clustering exercise are the used by management to create special catalogs and
distribute them to the correct target population based on the cluster for that catalog.
Dimensionality
In some application domains, the number of dimensions (or at-tributes of a record) can be
very large, which makes the data difficult to an-alyze because of the ‘curse of
dimensionality’.
For example, in bioinformatics, the development of advanced microarray technologies
allows us to analyze gene expression data with thousands of attributes. The
dimensionality of a data mining problem may also increase substantially due to the
temporal, spatial, and sequential nature of the data.
Complex Data
Traditional statistical methods often deal with simple data types such as continuous and
categorical attributes. However, in recent years, more complicated types of structured and
semi-structured data have become more important. One example of such data is graph-
based data representing the linkages of web pages, social networks, or chemical
structures. Another example is the free-form text that is found on most web pages.
Traditional data analysis techniques often need to be modified to handle the complex
nature of such data.
Data Quality
Many data sets have one or more problems with data quality, e.g.,some values may be
erroneous or inexact, or there may be missing values. As a result, even if a ‘perfect’ data
mining algorithm is used to analyze the data, the information discovered may still be
incorrect. Hence, there is a need for data mining techniques that can perform well when
the data quality is less than perfect.
Data warehousing:
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision making process
Integrated:
A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
Data cleaning and data integration techniques are applied to ensure consistency in
naming conventions, encoding structures, attribute measures, and so on.
Time-variant:
Data are stored to provide info. rom an historic perspective (e.g., the past 5–10 years).
Every key structure in the data warehouse contains, either implicitly or explicitly, a
time element.
Non-volatile:
A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment.
A data warehouse does not require transaction processing, recovery, and concurrency
control mechanisms.
It usually requires only two operations in data accessing: initial loading of data and
access of data.
The middle tier is an OLAP server that is typically implemented using either
i. Relational OLAP(ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or
ii. A multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that
directly implements multidimensional data and operations).
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Metadata Repository
Metadata are data about data.
Metadata are the data that define warehouse objects.
Are created for the data names and definitions of the given warehouse.
Additional metadata are created and captured for timestamping any extracted data, the
source of the extracted data, and missing fields that have been added by data cleaning or
integration processes.
A metadata repository should contain the following:
i. Description of the data warehouse structure, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data
mart locations and contents.
ii. Operational metadata, which include data lineage (history of migrated data & the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, &
audit trails).
iii. Algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.
iv. Mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control).
v. Data related to system performance, which include indices and profiles that
improve data access and retrieval performance, in addition to rules for the timing &
scheduling of refresh, update, and replication cycles.
vi. Business metadata, which include business terms and definitions, data ownership
information, and charging policies.
ii. The data contained in data marts tend to be summarized. Data marts are usually
implemented on low-cost departmental servers that are Unix / Linux or Windows
based.
iii. The implementation cycle of a data mart is more likely to be measured in weeks
rather than months or years. However, it may involve complex integration in the long
run if its design and planning were not enterprise-wide.
iv. Depending on the source of data, data marts can be categorized as independent or
dependent.
a. Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
b. Dependent data marts are sourced directly from enterprise data warehouses.
3. Virtual warehouse:
i. A virtual warehouse is a set of views over operational databases.
ii. For efficient query processing, only some of the possible summary views may be
materialized.
iii. A virtual warehouse is easy to build but requires excess capacity on operational
database servers.
First, a high-level corporate data model is defined within a reasonably short period (such
as one or two months) that provides a corporate-wide, consistent, integrated view of data
among different subjects and potential usages. This high-level model, although it will need to
be refined in the further development of enterprise data warehouses and departmental data
marts, will greatly reduce future integration problems.
Second, independent data marts can be implemented in parallel with the enterprise
warehouse based on the same corporate data model set noted before.
Third, distributed data marts can be constructed to integrate different data marts via hub
servers.
Finally, a multitier data warehouse is constructed where the enterprise warehouse is the
sole custodian of all warehouse data, which is then distributed to the various dependent data
marts.
Extraction, Transformation, and Loading (ETL)
• The ETL process involves extracting, transforming and loading data from multiple source
systems.
• The process is much more complex and tedious. The process may require significant
resources to implement.
• Different data-sources tend to have
→ different conventions for coding information &
→ different standards for the quality of information
• Building an ODS requires data filtering, data cleaning and integration.
• Data-errors at least partly arise because of unmotivated data-entry staff.
ETL FUNCTIONS
• The ETL process consists of
→ data extraction from source systems
→ data transformation which includes data cleaning and
→ data loading in the ODS or the data warehouse
• Data cleaning deals with detecting & removing errors/inconsistencies from the data, in
particular the data that is sourced from a variety of computer systems.
• Building an integrated database from a number of source-systems may involve solving
some of the following problems:
2. Data Errors
• Following are different types of data errors
→ data may have some missing attribute values
→ there may be duplicate records
→ there may be wrong aggregations
→ there may be inconsistent use of nulls, spaces and empty values
→ some attribute values may be inconsistent(i.e. outside their domain)
→ there may be non-unique identifiers
3. Record Linkage Problem
This deals with problem of linking information from different databases that
relates to the same customer.
Record linkage can involve a large number of record comparisons to ensure
linkages that have a high level of accuracy.
Data warehouse systems use back-end tools and utilities to populate and refresh their data.
These tools and utilities include the following functions:
1. Data extraction, which typically gathers data from multiple, heterogeneous, and
external sources.
2. Data cleaning, which detects errors in the data and rectifies them when possible.
3. Data transformation, which converts data from legacy or host format to
warehouse format.
4. Load, which sorts, summarizes, consolidates, computes views, checks integrity, &
5. builds indices and partitions.
6. Refresh, which propagates the updates from the data sources to the warehouse.
Sl.
Feature OLTP OLAP
No
Customer oriented. Market oriented.
User &
Used for: query and transactions Used for Data analysis
1 System
by clerks by Knowledge workers
Orientation
(managers, analyst)
Large amount of historic
2 Data base Current data
data.
Data base
3 Adopts ER model Star & Snowflake schema
design
Short atomic transactions. Requires
Access
4 concurrency control. Provides Read only operations
patterns
recovery mechanism
5 Size GB to Higher order ≥ TB
Query throughput,
6 Metric Transaction throughput
response time.
7 Update records Continuous Batch mode
Complex (Summarized,
8 Query & View Simple (Detailed, flat relational)
multi-dimensional)
9 Focus Data-in Information-out
10 No. of Users Thousands Hundreds
11 No. of Records Tens Millions
High performance, High flexibility,
12 Priority
High availability End-user autonomy
13 Operations Index/hash on primary key. Lots of scans.
Dimensions are the perspectives or entities with respect to which an organization wants to
keep records.
For example, AllElectronics may create a sales data warehouse in order to keep
records of the store’s sales with respect to the dimensions time, item, branch, and location.
Each dimension may have a table associated with it, called a dimension table, which further
describes the dimension.
For example, a dimension table for item may contain the attributes item name, brand, and
type. Dimension tables can be specified by users or experts, or automatically generated and
adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme, such as
sales. This theme is represented by a fact table. Facts are numeric measures. The fact table
contains he names of the facts, or measures, as well as keys to each of the related dimension
tables.
Although we usually think of cubes as 3-D geometric structures, in data warehousing
the data cube is n-dimensional. To gain a better understanding of data cubes and the
multidimensional data model, let’s start by looking at a simple 2-D data cube that is, in fact, a
table or spreadsheet for sales data from AllElectronics. In particular, we will look at the
AllElectronics sales data for items sold per quarter in the city of Vancouver. These data are
shown in Table 1. In this 2-D representation, the sales for Vancouver are shown with respect
to the time dimension (organized in quarters) and the item dimension (organized according to
the types of items sold). The fact or measure displayed is dollars sold (in thousands).
Table1: 2D View of Sales Data for AllElectronics according to time and item.
Now, suppose that we would like to view the sales data with a third dimension. For instance,
suppose we would like to view the data according to time and item, as well as location, for
the cities Chicago, New York, Toronto, and Vancouver. These 3-D data are shown in Table
2. The 3-D data in the table are represented as a series of 2-D tables. Conceptually, we may
also represent the same data in the form of a 3-D data cube, as in Figure 2.
Table-2: 3D View of Sales Data for AllElectronics according to time, item & locations.
Tables 1 and 2 show the data at different degrees of summarization. In the data
warehousing research literature, a data cube like those shown in Figures 7 and Figure 8 is
often referred to as a cuboid.
Figure 7: 3-D data cube rep.n of the data in Table-2, according to time, item, & location.
Figure 8: 4-D data cube representation of sales data, according to time, item, location, and
supplier.
The cuboid that holds the lowest level of summarization is called the base cuboid. For
example, the 4-D cuboid in Figure-8 is the base cuboid for the given time, item, location, and
supplier dimensions. Figure-7 is a 3-D (non-base) cuboid for time, item, and location,
summarized for all suppliers. The 0-D cuboid, which holds the highest level of
summarization, is called the apex cuboid. In our example, this is the total sales, or dollars
sold, summarized over all four dimensions. The apex cuboid is typically denoted by all.
Figure-9: Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier.
Each cuboid represents a different degree of summarization.
OLAP
• OLAP stands for Online Analysis Processing Systems.
• This is primarily a software-technology concerned with fast analysis of enterprise
information.
• In other words, OLAP is the dynamic enterprise analysis required to create, manipulate,
animate & synthesize information from exegetical, contemplative and formulaic data
analysis models.
• Business-Intelligence(BI) is used to mean both data-warehousing and OLAP.
• In other words, BI is defined as a user-centered process of
→ exploring data, data-relationships and trends
→ thereby helping to improve overall decision-making.
Analytic
• The system must provide rich analytic-functionality.
• Most queries should be answered without any programming.
• The system should be able to cope with any relevant queries for application & user.
Shared
• The system is
→ likely to be accessed only by few business-analysts and
→ may be used by thousands of users
• Being a shared system, the OLAP software should provide adequate security for
confidentiality & integrity.
• Concurrency-control is obviously required if users are writing or updating data in
the database.
Multidimensional
Information
Schemas for
Multidimensional Data Models: Stars, Snowflakes, & Fact Constellations:
The entity-relationship data model is commonly used in the design of relational databases,
where a database schema consists of a set of entities and the relationships between them.
Such a data model is appropriate for online transaction processing.
A data warehouse, however, requires a concise, subject-oriented schema that facilitates online
data analysis. The most popular data model for a data warehouse is a multidimensional
model, which can exist in the form of a star schema, a snowflake schema, or a fact
constellation schema.
1. Star schema:
The most common modeling paradigm is the star schema, in which the data warehouse
contains
i. A large central table (fact table) containing the bulk of the data, with no
redundancy, and
ii. A set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.
Example: A star schema for AllElectronics sales is shown in Figure 10. Sales are considered
along four dimensions: time, item, branch, and location. The schema contains a central fact
table for sales that contains keys to each of the four dimensions, along with two measures:
dollars sold and units sold. To minimize the size of the fact table, dimension identifiers (e.g.,
time key and item key) are system-generated identifiers.
Notice that in the star schema, each dimension is represented by only one table, and each
table contains a set of attributes. For example, the location dimension table contains the
attribute set {location_key, street, city, province or state, country}. This constraint may
introduce some redundancy.
2. Snowflake Schema:
The snowflake schema is a variant of the star schema model, where some dimension tables
are normalized, thereby further splitting the data into additional tables. The resulting schema
graph forms a shape similar to a snowflake.
i. The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to
reduce redundancies.
ii. Such a table is easy to maintain and saves storage space. However, this space
savings is negligible in comparison to the typical magnitude of the fact table.
iii. Furthermore, the snowflake structure can reduce the effectiveness of browsing,
since more joins will be needed to execute a query. Consequently, the system
performance may be adversely impacted. Hence, although the snowflake schema
reduces redundancy, it is not as popular as the star schema in data warehouse
design.
Fact constellation:
Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a
galaxy schema or a fact constellation.
Dept. of CSE, BITM - Ballari Page 25
15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y
Example: Fact constellation. A fact constellation schema is shown in Figure 12. This schema
specifies two fact tables, sales and shipping. The sales table definition is identical to that of
the star schema (Figure-10). The shipping table has five dimensions, or keys—item key, time
key, shipper key, from location, and to location—and two measures—dollars cost and units
shipped. A fact constellation schema allows dimension tables to be shared between fact
tables. For example, the dimensions tables for time, item, and location are shared between the
sales and shipping fact tables.
Figure-13: A concept hierarchy for location. Due to space limitations, not all of the hierarchy
nodes are shown, indicated by ellipses between nodes.
Figure 13(b)
Figure 13(a)
Figure 13: Hierarchical and lattice structures of attributes in warehouse dimensions: (a) a
hierarchy for location and (b) a lattice for time.
For the same reason, count(), min(), and max() are distributive aggregate functions.
A measure is distributive if it is obtained by applying a distributive aggregate function.
Distributive measures can be computed efficiently because of the way the computation
can be partitioned.
2. Algebraic:
i. An aggregate function is algebraic if it can be computed by an algebraic function
with M arguments (where M is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function.
For example, avg() (average) can be computed by sum()/count(), where both sum() and
count() are distributive aggregate functions.
Similarly, it can be shown that min N() and max N() (which find the N minimum and N
maximum values, respectively, in a given set) and standard deviation() are algebraic
aggregate functions.
A measure is algebraic if it is obtained by applying an algebraic aggregate function.
3. Holistic:
i. An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a sub-aggregate. That is, there does not exist an algebraic
function with M arguments (where M is a constant) that characterizes the
computation.
Common examples of holistic functions include median(), mode(), and rank().
A measure is holistic if it is obtained by applying a holistic aggregate function.
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives. A
number of OLAP data cube operations exist to materialize these different views, allowing
interactive querying and analysis of the data at hand. Hence, OLAP provides a user-friendly
environment for interactive data analysis.
OLAP operations. . Each of the following operations described is illustrated in Figure 14. At
the center of the figure is a data cube for AllElectronics sales. The cube contains the
dimensions location, time, and item, where location is aggregated with respect to city values,
time is aggregated with respect to quarters, and item is aggregated with respect to item types.
For better understanding, we refer to this cube as the central cube. The measure displayed is
dollars sold (in thousands). (For improved readability, only some of the cubes’ cell values are
shown.) The data examined are for the cities Chicago, New York, Toronto, and Vancouver.
1. Roll-up:
i. The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension
or by dimension reduction.
ii. Figure 14 shows the result of a roll-up operation performed on the central cube by
climbing up the concept hierarchy for location given in Figure 13.
This hierarchy was defined as the total order “street < city < province or state <
country.”
iii. The roll-up operation shown aggregates the data by ascending the location hierarchy
from the level of city to the level of country.
a. In other words, rather than grouping the data by city, the resulting cube groups
the data by country.
iv. When roll-up is performed by dimension reduction, one or more dimensions are
removed from the given cube.
2. Drill-down:
i. Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data.
ii. Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.
iii. Drill-down occurs by descending the time hierarchy from the level of quarter to the
more detailed level of month.
iv. The resulting data cube details the total sales per month rather than summarizing them
by quarter.
v. Because a drill-down adds more detail to the given data, it can also be performed by
adding new dimensions to a cube.
4. Pivot (rotate):
Pivot (also called rotate) is a visualization operation that rotates the data axes in view to
provide an alternative data presentation. Figure 14 shows a pivot operation where the item
and location axes in a 2-D slice are rotated. Other examples include rotating the axes in a 3-D
cube, or transforming a 3-D cube into a series of 2-D planes.
1. Task-relevant data: This primitive specifies the data upon which mining is to be
performed. It involves specifying the database and tables or data warehouse containing
the relevant data, conditions for selecting the relevant data, the relevant attributes or
dimensions for exploration, and instructions regarding the ordering or grouping of the
data retrieved.
2. Knowledge type to be mined: This primitive specifies the specific data mining function
to be performed, such as characterization, discrimination, association, classification,
clustering, or evolution analysis. As well, the user can be more specific and provide
pattern templates that all discovered patterns must match. These templates or meta
patterns (also called meta rules or meta queries), can be used to guide the discovery
process.
3. Background knowledge: This primitive allows users to specify knowledge they have
about the domain to be mined. Such knowledge can be used to guide the knowledge
discovery process and evaluate the patterns that are found. Of the several kinds of
background knowledge, this chapter focuses on concept hierarchies.
4. Pattern interestingness measure: This primitive allows users to specify functions that
are used to separate uninteresting patterns from knowledge and may be used to guide the
mining process, as well as to evaluate the discovered patterns. This allows the user to
confine the number of uninteresting patterns returned by the process, as a data mining
process may generate a large number of patterns. Interestingness measures can be
specified for such pattern characteristics as simplicity, certainty, utility and novelty.