0% found this document useful (0 votes)
14 views20 pages

DATA Science Unit - II Part 1

BCA

Uploaded by

SPANDANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views20 pages

DATA Science Unit - II Part 1

BCA

Uploaded by

SPANDANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Unit-II

DataWarehousingandOnlineAnalyticalProcessing:BasicConcepts
What Is a DataWarehouse?
Loosely speaking, a data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases.
According to WilliamH. Inmon, a leading architect inthe constructionofdata warehouse systems, “A
data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in
support of management’s decision making process”.
Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales. Rather than concentrating on the day-to-day operations and transaction processing
ofanorganization, adatawarehousefocusesonthe modelingandanalysisofdatafor decisionmakers.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,
such as relational databases, flat files, and online transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
attribute measures, and so on.
Time-variant: Data are stored to provide information froman historic perspective (e.g., the past 5–10
years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse does
not require transaction processing, recovery, and concurrency control mechanisms. It usually requires
onlytwo operations in data accessing: initial loading of data and access of data.
data warehousing as the process of constructing and using data warehouses. The construction of a
data warehouse requires data cleaning, data integration, and data consolidation.
“How are organizations using the information from data warehouses?” Many organizations use this
information to support business decision-making activities, including (1) increasing customer focus,
which includes the analysis of customer buying patterns (such as buying preference, buying time,
budget cycles, and appetites for
spending); (2) repositioning products and managing product portfolios by comparing the performance
of sales by quarter, by year, and by geographic regions in order to fine-tune production strategies; (3)
analyzing operations and looking for sources of profit; and (4) managing customer relationships,
making environmentalcorrections, and managing the cost of corporate assets.
The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases. When a query is posed to a
client site, a metadata dictionary is used to translate the query into queries appropriate for the
individual heterogeneous sites involved. These queries are then mapped and sent to local query
processors. The results returned from the different sites are integrated into a global answer set. This
query-driven approach requires complex information filtering and integration processes, and
competes with local sites for processing resources. It is inefficient and potentially expensive for
frequent queries, especially queries requiring aggregations.
Data warehousing provides an interesting alternative to this traditional approach. Rather than using a
query-driven approach, data warehousing employs an update-driven approach in which information
from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct
querying and analysis.

DifferencesbetweenOperationalDatabaseSystemsand DataWarehouses

The major task of online operational database systems is to perform online transaction and query
processing.Thesesystemsarecalledonlinetransactionprocessing(OLTP)systems.They cover

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
most of the day-to-day operations of an organization such as purchasing, inventory, manufacturing,
banking, payroll, registration, and accounting. Data warehouse systems, ontheother hand, serve users
orknowledgeworkersintheroleofdataanalysis and decision making. Suchsystemscanorganizeand
present data in various formats in order to accommodate the diverse needs of different users. These
systems are known as online analytical processing (OLAP) systems.
ThemajordistinguishingfeaturesofOLTPandOLAParesummarizedasfollows:

Users and system orientation: An OLTP system is customer-oriented and is used for transaction and
query processing by clerks, clients, and information technology professionals. An OLAP system is
market-oriented and is used for data analysis by knowledge workers, including managers, executives,
and analysts.

Datacontents:AnOLTPsystemmanagescurrentdatathat,typically, aretoo detailedtobeeasily used for


decision making. An OLAP system manages large amounts of historic data, provides facilities for
summarizationand aggregation, and stores and manages informationat different levels ofgranularity.

Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an
application-oriented database design. An OLAP system typically adopts either a star or a snowflake
model and a subject-oriented database design.

View: An OLTP system focuses mainly on the currentdata within an enterprise or department, without
referring to historic data or data in different organizations. In contrast, an OLAP systemoften spans
multiple versions ofa database schema, due to the evolutionaryprocess ofan organization.

Access patterns: The access patterns ofan OLTP systemconsist mainly ofshort, atomic transactions.
Such a systemrequires concurrency control and recovery mechanisms. However, accesses to OLAP
systems are mostly read-only operations (because most data warehouses store historic rather than up-
to-date information), although many could be complex queries.

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
But,WhyHaveaSeparateDataWarehouse?
“Why not perform online analytical processing directly on such databases instead of spending
additional time and resources to construct a separate data warehouse?”
A major reason for such a separation is to help promote the high performance of both systems. An
operationaldatabase isdesignedandtuned fromknowntasksandworkloads like indexingand hashing
using primary keys, searching for particular records, and optimizing “canned” queries. On the other
hand, data warehouse queries are oftencomplex. Theyinvolve the computationof large data groupsat
summarized levels, and may require the use of special data organization, access, and implementation
methods based on multidimensional views. Processing OLAP queries in operational databases would
substantially degrade the performance of operational tasks.
Moreover, an operational database supports the concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms (e.g., locking and logging) are required to ensure the
consistency and robustness of transactions. An OLAP query often needs read-only access of data
records for summarization and aggregation. Decision support requires historic data, whereas
operational databases do not typically maintain historic data.
In this context, the data in operational databases, though abundant, are usually far from complete for
decision making. Decision supportrequires consolidation (e.g., aggregation and summarization) of data
from heterogeneous sources, resulting in high-quality, clean, integrated data. In contrast, operational
databases contain only detailed raw data, such as transactions, which need to be consolidated before
analysis.

Data Warehouse Architecture In this section, we discuss issues regarding data warehouse architecture.
Section 3.3.1 gives a general account of how to design and construct a data warehouse. Section 3.3.2
describes a three-tier data warehouse architecture. Section 3.3.3 describes back-end tools and utilities for
data warehouses. Section 3.3.4 describes the metadata repository. Section 3.3.5 presents various types of
warehouse servers for OLAP processing

Steps for the Design and Construction of Data Warehouses This subsection presents a business
analysis framework for data warehouse design. The basic steps involved in the design process are also
described.
The Design of a Data Warehouse: A Business Analysis Framework
“What can business analysts gain from having a data warehouse?” First, having a data warehouse may
provide a competitive advantage by presenting relevant information from which to measure performance
and make critical adjustments in order to help win over competitors. Second, a data warehouse can
enhance business productivity because it is able to quickly and efficiently gather information that
accurately describes the organization. Third, a data warehouse facilitates customer relationship
management because it provides a consistent view of customers and items across all lines of business, all
departments, and all markets. Finally, a data warehouse may bring aboutcost reduction by tracking
trends, patterns, and exceptions over long periods in a consistent and reliable manner. To design an
effective data warehouse we need to understand and analyze business needs and construct a business
analysis framework. The construction of a large and complex information system can be viewed as the
construction of a large and complex building, for which the owner, architect, and builder have different
views. These views are combined to form a complex framework that represents the top-down, business-
driven, or owner’s perspective, as well as the bottom-up, builder-driven, or implementor’s view of the
information system. Four different views regarding the design of a data warehouse must be considered:
the top-down view, the data source view, the data warehouse view, and the business query view. The
top-down view allows the selection of the relevant information necessary for the data warehouse. This
information matches the current and future business needs. The data source view exposes the information
Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-
Elsevier
being captured, stored, and managed by operational systems. This information may be documented at
various levels of detail and accuracy, from individual data source tables to integrated data source tables.
Data sources are often modeled by traditional data modeling techniques, such as the entity-relationship
model or CASE (computer-aided software engineering) tools. The data warehouse view includes fact
tables and dimension tables. It represents the information that is stored inside the data warehouse,
including precalculated totals and counts, as well as information regarding the source, date, and time of
origin, added to provide historical context. Finally, the business query view is the perspective of data in
the data warehouse from the viewpoint of the end user.
Building and using a data warehouse is a complex task because it requires business skills, technology
skills, and program management skills. Regarding business skills, building a data warehouse involves
understanding how such systems store and manage their data, how to build extractors that transfer data
from the operational system to the data warehouse, and how to build warehouse refresh software that
keeps the data warehouse reasonably up-to-date with the operational system’s data. Using a data
warehouse involves understanding the significance of the data it contains, as well as understanding and
translating the business requirements into queries that can be satisfied by the data warehouse. Regarding
technology skills, data analysts are required to understand how to make assessments from quantitative
information and derive facts based on conclusions from historical information in the data warehouse.
These skills include the ability to discover patterns and trends, to extrapolate trends based on history and
look for anomalies or paradigm shifts, and to present coherent managerial recommendations based on
such analysis. Finally, program management skills involve the need to interface with many technologies,
vendors, and end users in order to deliver results in a timely and cost-effective manner.

The Process of Data Warehouse Design A data warehouse can be built using a top-down approach, a
bottom-up approach, or a combination of both. The top-down approach starts with the overall design and
planning. It is useful in cases where the technology is mature and well known, and where the business
problems that must be solved are clear and well understood. The bottom-up approach starts with
experiments and prototypes. This is useful in the early stage of business modeling and technology
development. It allows an organization to move forward at considerably less expense and to evaluate the
benefits of the technology before making significant commitments. In the combined approach, an
organization can exploit the planned and strategic nature of the top-down approach while retaining the
rapid implementation and opportunistic application of the bottom-up approach. From the software
engineering point of view, the design and construction of a data warehouse may consist of the following
steps: planning, requirements study, problem analysis, warehouse design, data integration and testing,
and finally deployment of the data warehouse. Large software systems can be developed using two
methodologies: the waterfall method or the spiral method. The waterfall method performs a structured
and systematic analysis at each step before proceeding to the next, which is like a waterfall, falling from
one step to the next. The spiral method involves the rapid generation of increasingly functional systems,
with short intervals between successive releases. This is considered a good choice for data warehouse
development, especially for data marts, because the turnaround time is short, modifications can be done
quickly, and new designs and technologies can be adapted in a timely manner.

In general, the warehouse design process consists of the following steps:


1. Choose a business process to model, for example, orders, invoices, shipments, inventory, account
administration, sales, or the general ledger. If the business

process is organizational and involves multiple complex object collections, a data warehouse model
should be followed. However, if the process is departmental and focuses on the analysis of one kind of
business process, a data mart model should be chosen.
2. Choose the grain of the business process. The grain is the fundamental, atomic level of data to be
represented in the fact table for this process, for example, individual transactions, individual daily
snapshots, and so on.
Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-
Elsevier
3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item,
customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are numeric additive
quantities like dollars sold and units sold.
Because data warehouse construction is a difficult and long-term task, its implementation scope should
be clearly defined. The goals of an initial data warehouse implementation should be specific, achievable,
and measurable. This involves determining the time and budget allocations, the subset of the
organization that is to be modeled, the number of data sources selected, and the number and types of
departments to be served. Once a data warehouse is designed and constructed, the initial deployment of
the warehouse includes initial installation, roll-out planning, training, and orientation. Platform upgrades
and maintenance must also be considered. Data warehouse administration includes data refreshment, data
source synchronization, planning for disaster recovery, managing access control and security, managing
data growth, managing database performance, and data warehouse enhancement and extension. Scope
management includes controlling the number and range of queries, dimensions, and reports; limiting the
size of the data warehouse; or limiting the schedule, budget, or resources. Various kinds of data
warehouse design tools are available. Data warehouse development tools provide functions to define and
edit metadata repository contents (such as schemas, scripts, or rules), answer queries, output reports, and
ship metadata to and from relational database system catalogues. Planning and analysis tools study the
impact of schema changes and of refresh performance when changing refresh rates or time windows.
DataWarehousing: AMultitieredArchitecture

1. ThebottomtierisawarehousedatabaseserverthatisalmostalwaysarelationalDatabase system.Back-
endtoolsandutilitiesareusedto feeddataintothebottomtier fromoperational

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
databases or other external sources. These tools and utilities perform data extraction, cleaning, and
transformation(e.g.,tomergesimilardatafromdifferent sources into aunified format), aswellas load and
refresh functions to update the data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client
programs to generate SQL code to be executed at a server. Examples of gateways include ODBC
(Open Database Connection) and OLEDB (Object Linking and Embedding Database) by Microsoft
and JDBC (Java Database Connection). This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2. The middle tier is an OLAP server that is typically implemented using either (1) a
relationalOLAP(ROLAP) model (i.e., an extended relational DBMS that maps operations on
multidimensionaldatato standardrelationaloperations);or (2) a multidimensionalOLAP(MOLAP)
model(i.e., a special-purpose server that directly implements multidimensionaldata and operations).
3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).

DataWarehouseModels:EnterpriseWarehouse,DataMart andVirtualWarehouse

Fromthe architecture point of view, there are three data warehouse models: the enterprise warehouse,
the data mart, and the virtual warehouse.
Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
spanningthe entire organization. It provides corporate-wide dataintegration, usuallyfromoneormore
operational systems or external information providers, and is cross-functional in scope. It typically
contains detailed data as well as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on
traditional mainframes, computer superservers, or parallel architecture platforms. It requires extensive
business modeling and may take years to design and build.

Data mart: Adata mart contains a subset ofcorporate-wide datathat is of value to a specific groupof
users. The scope is confined to specific selected subjects. For example, a marketing data mart may
confine its subjects to customer, item, and sales. The data contained in data marts tend to be
summarized.
Data marts are usually implemented on low-cost departmental servers that are Unix/LinuxorWindows
based. The implementation cycle of a data martis more likely to be measured in weeks rather than
months or years. However, it may involve complex integration in the long run if its design and
planning were not enterprise-wide.
Datamartsaretwotypes.Theyare
1. Independentdatamart
2. Dependentdatamart
1. Independent data marts are sourced from data captured from one or more operational systems
orexternalinformationproviders, orfromdata generatedlocallywithinaparticulardepartment or
geographic area.
2. Dependentdata martsaresourced directlyfromenterprise datawarehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient
queryprocessing, onlysomeofthepossible summaryviews may be materialized. Avirtualwarehouse is
easyto build but requires excess capacity on operationaldatabase servers.
“What are the pros and cons of the top-down and bottom-up approaches to data warehouse
development?”

top-downdevelopment ofanenterprise warehouse


pros

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
minimizesintegrationproblems
cons
1. itisexpensive
2. ittakesalong timetodevelop,
3. lacksflexibility
bottom-up approach
pros
flexibility,lowcost,andrapidreturnofinvestment.
Cons
leadto problemswhenintegrating variousdisparatedatamartsinto aconsistent enterprisedata warehouse.

A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner. First, a high-level corporate data model is
defined within a reasonably short period (such as one or two months) that provides a corporate-wide,
consistent, integratedview ofdata among different subjectsand potentialusages. Second, independent
data marts can be implemented in parallel with the enterprise warehouse based on the same corporate
data modelset noted before. Third, distributed data marts canbe constructedto integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.

Extraction, Transformation,andLoading
Datawarehousesystemsuse back-endtoolsandutilitiesto populateandrefreshtheir data.Thesetools and
utilities include the following functions:
Dataextraction,whichtypicallygathersdatafrommultiple,heterogeneous,andexternalsources.
Datacleaning,whichdetectserrorsinthedataandrectifiesthemwhenpossible.
Datatransformation,whichconvertsdatafromlegacyor hostformattowarehouse format.
Load,whichsorts,summarizes,consolidates,computesviews,checksintegrity,andbuildsindicesand
partitions.
Refresh,whichpropagatesthe updates fromthe datasourcestothewarehouse.

Metadata Repository

Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouseobjects.Metadataarecreatedforthedatanamesanddefinitionsof thegivenwarehouse.

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
Additional metadata are created and captured for timestamping any extracted data, the source of the
extracted data, and missing fields that have beenadded bydata cleaning or integration processes.
Ametadatarepositoryshouldcontainthefollowing:
 A description of the data warehouse structure, which includes the warehouse schema, view,
dimensions, hierarchies, and derived data definitions, as well as data mart locations and
contents.
 Operational metadata, which include data lineage(history of migrated data and the sequence
oftransformations applied to it), currencyofdata (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).
 The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
 Mapping from the operational environment to the data warehouse, which includes source
databases and their contents, gateway descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules, and security (user
authorization and access control).
 Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of refresh,
update, and replication cycles.
 Business metadata, which include business terms and definitions, data ownership information,
and charging policies.
A data warehouse contains different levels of summarization, of which metadata is one. Other types
include current detailed data (which are almost always on disk), older detailed data (which are usually
on tertiary storage), lightly summarized data, and highly summarized data (which may or may not be
physically housed).

Types of OLAP Servers:


ROLAP versus MOLAP versus HOLAP Logically, OLAP servers present business users with
multidimensional data from data warehouses or data marts, without concerns regarding how or where the
data are stored. However, the physical architecture and implementation of OLAP servers must consider
data storage issues. Implementations of a warehouse server for OLAP processing include the following:

Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational DBMS
to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers
include optimization for each DBMS back end, implementation of aggregation navigation logic, and
additional tools and services. ROLAP technology tends to have greater scalability than MOLAP
technology. The DSS server of Microstrategy, for example, adopts the ROLAP approach.

Multidimensional OLAP
(MOLAP) servers: These servers support multidimensional views of data through array-based
multidimensional storage engines. They map multidimensional views directly to data cube array
structures. The advantage of using a datacube is that it allows fast indexing to precomputed summarized
data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is
sparse. In such cases, sparse matrix compression techniques should be explored (Chapter 4). Many
MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser
subcubes are identified and stored as array structures, whereas sparse subcubes employ compression
technology for efficient storage utilization.

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-
Elsevier
technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP.
For example, a HOLAP server may allow large volumes of detail data to be stored in a relational
database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000
supports a hybrid OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases,
some database system vendors implement specialized SQL servers that provide advanced query language
and query processing support for SQL queries over star and snowflake schemas in a read-only
environment.

“How are data actually stored in ROLAP and MOLAP architectures?” Let’s first look at ROLAP. As its
name implies, ROLAP uses relational tables to store data for on-line analytical processing. Recall that
the fact table associated with a base cuboid is referred to as a base fact table. The base fact table stores
data at the abstraction level indicated by the join keys in the schema for the given data cube. Aggregated
data can also be stored in fact tables, referred to as summary fact tables. Some summary fact tables store
both base fact table data and aggregated data, as in Example 3.10. Alternatively, separate summary fact
tables can be used for each level of abstraction, to store only aggregated data.
DataWarehouseModeling:DataCubeandOLAP

Data warehouses andOLAPtools are basedon amultidimensional datamodel.Thismodel views data in


the form of a data cube.

DataCube:AMultidimensionalData Model

“What is adata cube?” Adatacubeallowsdatato be modeled and viewed in multiple dimensions. It is


defined by dimensions and facts.
In general terms, dimensions are the perspectives or entities with respect to which an organization
wants to keep records. For example, AllElectronics maycreate a sales datawarehouse inorderto keep
records of the store’s sales with respect to the dimensions time, item, branch, and location. Each
dimension may have a table associated with it, called a dimension table, which further describes the
dimension. For example, a dimension table for item may contain the attributes item name, brand, and
type.
A multidimensional data model is typically organized around a central theme, such as sales. This
theme is represented by a fact table. Facts are numeric measures. Examples of facts for a sales data
warehouse include dollars_sold (sales amount in dollars), units_sold (number of units sold), and
amount_budgeted. The fact table contains the names ofthe facts,or measures, as wellas keys to each of
the related dimension tables.
In 2-D representation, the sales for Vancouver are shown with respect to the timedimension (organized
in quarters) and the item dimension (organized according to the types of items sold). The fact or
measure displayed is dollars_sold (inthousands). Now, supposethat we would like to view the sales
datawith a thirddimension.Forinstance,supposewe wouldliketoviewthedataaccordingto

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
time and item, as well as location, for the cities Chicago, New York, Toronto, andVancouver.Suppose
that we would now like to view our sales data with an additional fourth dimension such as supplier.
Viewing things in 4-D becomes tricky. However, we can think of a 4-D cube as being a series of 3-D
cubes, as shown below,

4D cube is oftenreferredto as a Cuboid. Givena setofdimensions, we cangenerate acuboid for each of


the possible subsets of the given dimensions. The result would form a lattice of cuboids, each showing
the data at a different level of summarization, or group-by. The lattice of cuboids is then referred to as
a data cube.

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
latticeofcuboids
The following figure shows the forming of a data cube for the dimensions time, item, location, and
supplier. The cuboid that holds the lowest level of summarization is called the base cuboid. For
example, the 4-D cuboid in Figure 4.4 is the base cuboid for the given time,item, location, and
supplierdimensions. Figure4.3 isa3-D(nonbase)cuboid for time, item, andlocation, summarized for all
suppliers. The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid.
In our example, this is the total sales, or Dollars_sold, summarized over all four dimensions. The apex
cuboid is typically denoted byall.

Stars,Snowflakes,andFactConstellations: SchemasforMultidimensionalDataModels
A datawarehouse,however,requiresaconcise,subject-orientedschemathatfacilitatesonlinedata analysis.
The most popular data model for a data warehouse is a multidimensional model, whichcan existin the
formofa star schema, a snowflake schema, or a fact constellation schema.

Starschema: The most common modeling paradigm is the star schema, inwhichthe data warehouse
contains(1)alargecentraltable(facttable)containingthebulkofthedata,with noredundancy,and
(2)aset ofsmallerattendanttables(dimensiontables),oneforeachdimension.
Example 4.1 Star schema.A star schemafor AllElectronics salesis shown in Figure 4.6. Sales are
considered

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
along four dimensions: time, item, branch, and location. The schema contains a central fact table for
sales that contains keys to each of the four dimensions, alongwith two measures: dollars sold and
unitssold. To minimizethesizeofthe facttable, dimensionidentifiers(e.g., timekeyanditemkey) are
system-generated identifiers.

Notice that in the star schema, each dimension is represented by only one table, andeach table
containsasetofattributes.Forexample,thelocationdimensiontablecontainstheattributeset
{location_key, street, city, province_or_state, country}. This constraint may introduce some
redundancy. For example, “Urbana”and “Chicago”arebothcities inthe stateofIllinois, USA. Entries for
such cities in the location dimension table will create redundancy among the attributes
province_or_state and country;that is, (...., Urbana, IL, USA) and (, Chicago, IL, USA).Moreover, the
attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order).

Snowflakeschema:Thesnowflakeschemaisavariantofthestarschemamodel, wheresome dimension tables


are normalized, thereby further splitting the data into additionaltables.
Themajor differencebetweenthesnowflakeandstarschema modelsis
1. Dimension tables of the snowflakemodel may be keptin normalizedform to reduce
redundancies.
2. Tableiseasytomaintain.
3. Savesstoragespace.
Disadvantage
1. Thesnowflakestructurecanreducetheeffectivenessofbrowsing.
2. Morejoinswill be neededtoexecuteaquery.
3. Thesystemperformancemaybeadversely impacted.
Hence, althoughthe snowflake schema reduces redundancy,it is not as popular as the star schema in
data warehouse design.

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
The main difference between the two schemas is in the definition of dimension tables. The single
dimension table for item in the star schema is normalized in the snowflake schema, resulting in new
item and supplier tables. For example, the item dimension table now contains the attributes item_key,
item_name, brand, type, and supplier_key, where supplier_key is linked to the supplier dimension
table, containing supplier_key and
Supplier_type information. Similarly, the single dimensiontable for location inthe star schema canbe
normalized into two new tables: location and city. The city key in the new location table links to the
city dimension.
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension
tables. This kind of schema can be viewed as a collection of stars,andhence is called a galaxy schema
or a fact constellation.
Example 4.3 Fact constellation. A fact constellation schema is shown in Figure 4.8. This schema
specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star
schema (Figure 4.6). The shipping table has five dimensions, or keys—item_key, time_key,
shipper_key, from_location, and to_location—and two measures—dollars_cost and units_shipped. A
fact constellation schema allows dimension tables to be shared between fact tables. For example, the
dimensions tables for time, item, and location are shared betweenthe sales and shipping fact tables.

In data warehousing, there is a distinction between a data warehouse and a data mart. Adata
warehousecollectsinformationaboutsubjectsthatspantheentireorganization,suchascustomers,

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
items, sales, assets, and personnel, and thus its scope is enterprise-wide. For data warehouses, the fact
constellation schema is commonly used, since it can model multiple, interrelated subjects. A data
mart, on the other hand, is a departmentsubset of the data warehouse that focuses on selected subjects,
and thus its scope is departmentwide. For data marts, the star or snowflake schema is commonlyused.

Dimensions:TheRoleofConceptHierarchies
Aconcept hierarchy defines a sequence ofmappings froma setoflow-levelconcepts to higher-level,
more general concepts. Consider a concept hierarchy for the dimension location. City values for
location include Vancouver, Toronto, New York,and Chicago. Each city, however, can be mapped to
theprovinceorstateto whichit belongs. Forexample, Vancouvercanbe mappedto BritishColumbia, and
Chicago to Illinois.
The provinces and states can in turn be mapped to the country(e.g., Canada or the United States) to
which they belong. These mappings form a concept hierarchy for the dimension location, mapping a
set of low-level concepts (i.e., cities) to higher-level, more generalconcepts (i.e., countries).
Many concept hierarchies are implicit within the database schema. For example, suppose that the
dimensionlocation isdescribedbytheattributes number, street, city,province_or_state, zip_code,and
country.Theseattributesarerelatedbyatotalorder,forming aconcepthierarchysuchas“street<city
<province or state <country.” This hierarchy is shown in Figure 4.10(a). Alternatively, the attributes
of a dimension may

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
be organized in a partialorder, forming a lattice. An example ofa partialorder forthe time dimension
based on the attributes day, week, month, quarter, and year is “day < {month <quarter; week}
<year.”1 This lattice structure is shown in Figure 4.10(b). A concept hierarchy that is a total or partial
order among attributes in a database schemais called a schema hierarchy. Concept hierarchies that are
commonto manyapplications (e.g., for time) may be predefined inthe data mining system.
Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or
attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined among groups
of values. An example of a set-grouping hierarchy is shown in Figure 4.11 for the dimension price,
where an interval($X ...$Y] denotesthe range from $X (exclusive) to $Y (inclusive).
There may be more than one concept hierarchy for a given attribute or dimension, based on different
user viewpoints. For instance, a user may prefer to organize price by defining ranges for inexpensive,
moderately priced, and expensive.
Concepthierarchiesmaybeprovidedmanuallybysystemusers,domainexperts,orknowledgeengineers, or
may be automaticallygenerated based on statisticalanalysis of the data distribution.
Measures:TheirCategorizationandComputation
“Howaremeasures computed?”
Adata cube measure is a numeric functionthat can be evaluated at eachpoint inthe data cube space.
Ameasurevalue iscomputedfor agivenpoint byaggregatingthedatacorrespondingtotherespective
dimension–value pairs defining the given point.
for example, <time = “Q1”, location = “Vancouver”, item = “computer”>. – set of dimension- value
pairs

Measures can be organized into three categories—distributive, algebraic, and holistic— based on the
kind of aggregate functions used.

A measure is distributive if it is obtained by applying a distributive aggregate function. Distributive


measures canbe computed efficiently because ofthe waythe computationcanbe partitioned. it canbe
computed in a distributed manner as follows. Suppose the data are partitionedinton sets. We apply the
function to each partition, resulting in n aggregate values. If the result derived by applying the function
to the n aggregate values is the same as that derived by applying the functionto the entiredata set
(without partitioning), the function can be computed in a distributed manner. For example, sum() can
be computed for a data cube by first partitioning the cube into a set ofsubcubes, computing sum() for
eachsubcube, and thensumming upthe countsobtained foreachsubcube. Hence, sum() is a distributive
aggregate function. For the same reason, count(), min(), and max() are distributive aggregate
functions.

A measure is algebraic if it is obtained by applying an algebraic aggregate function. itcan be computed


byanalgebraic functionwith M arguments (where M is a bounded positive integer), eachof which is
obtainedby applying a distributive aggregate function.For example, avg() (average)can be

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
computed by sum()/count(), where both sum() and count() are distributive aggregate functions.
Similarly, it can be shown that min N() and max N() (which find the N minimum and N maximum
values, respectively, ina givenset) and standard deviation() are algebraic aggregate functions.

Ameasureisholisticifitisobtainedbyapplyingaholisticaggregate function.
An aggregate function is holistic ifthere is no constant bound onthe storage size needed to describe a
subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a
constant) that characterizes the computation. Common examples of holistic functionsinclude median(),
mode(), and rank().

Most large datacube applicationsrequire efficient computationofdistributive and algebraic measures.


Many efficient techniques for this exist. In contrast, it is difficult to compute holistic measures
efficiently. Efficient techniques to approximate the computation of some holistic measures, however,
do exist.

TypicalOLAPOperations
“How are concept hierarchies useful in OLAP?” In the multidimensional model, data are organized
into multiple dimensions, and each dimension contains multiple levels of abstraction defined by
concept hierarchies. This organization provides users with the flexibility to view data from different
perspectives.

Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs
aggregationonadatacube, either by climbingupaconcepthierarchy for adimensionor bydimension
reduction.
This hierarchy was defined as the total order “street <city <province or state <country.” The roll-up
operation shown aggregates the data by ascending the location hierarchy from the level of city to the
level of country
When roll-up is performed by dimension reduction, one or more dimensions are removed from the
givencube. For example, consider a salesdata cube containing onlythe location and time dimensions.
Roll-up may be performed by removing, say, the time dimension, resulting in an aggregation of the
totalsales by location, rather than by location and bytime.

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions. concept hierarchy for time defined as “day <month <quarter
<year.” Drill-down occurs by descending the time hierarchy fromthe level of quarter to the more
detailed level of month. Because a drill-down adds more detail to the given data, it can also be
performed by adding new dimensions to a cube.

Slice and dice: The slice operation performs a selection on one dimension ofthe given cube, resulting
in a subcube. Figure 4.12 shows a slice operation where the sales data are selected from the central
cube for the dimension time using the criterion time = “Q1.” The dice operation defines a subcube by
performing a selection on two or more dimensions. Figure 4.12 shows a dice operation on the central
cube based on the following selection criteria that involve three dimensions: (location = “Toronto” or
“Vancouver”) and (time = “Q1” or “Q2”) and (item = “home entertainment” or “computer”).

Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view
to provide an alternative data presentation. Figure 4.12 shows a pivot operation where the item and
location axes in a 2-D slice are rotated.

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
Other OLAP operations: Some OLAP systems offer additional drilling operations. For example,
drill-across executes queries involving (i.e., across) more than one fact table. The drill-through
operation uses relational SQL facilities to drill through the bottom level of a data cube down to its
back-end relational tables.
Other OLAP operations may include ranking the top N or bottom N items in lists, as wellascomputing
moving averages, growth rates, interests, internal return rates, depreciation, currency conversions, and
statistical functions.

OLAP offers analytical modeling capabilities, including a calculation engine for deriving ratios,
variance, and so on, and for computing measures across multiple dimensions. OLAP also supports
functional models for
forecasting,trendanalysis,andstatisticalanalysis.

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier
OLAPSystemsversusStatisticalDatabases
Astatisticaldatabaseisadatabasesystemthatisdesigned tosupportstatisticalapplications.
OLAP and SDB systems, however, have distinguishing differences. While SDBs tend to focus on
socioeconomic applications, OLAP has been targeted for business applications. Privacy issues
regarding concept hierarchies are a major concern for SDBs. For example, given summarized
socioeconomic data, it is controversial to allow users to view the correspondinglow-level data. Finally,
unlike SDBs, OLAP systems are designed for efficiently handling huge amounts ofdata.

AStarnetQueryModelforQueryingMultidimensionalDatabases

The querying ofmultidimensionaldatabasescan be based ona starnet model, whichconsistsofradial


lines emanating from a central point, where each line represents a concept hierarchy for a dimension.
Eachabstractionlevelinthe hierarchyis called a footprint. These represent the granularities available for
use by OLAP operations such as drill-down and roll-up.
Example4.5StarnetThisstarnetconsistsoffourradiallines,representingconcepthierarchies
for the dimensions location, customer, item, and time, respectively. Each line consists of footprints
representing abstraction levels ofthe dimension. For example, the time line has four footprints: “day,”
“month,” “quarter,” and “year.”

Concept hierarchies can be used to generalize data by replacing low-level values (such as “day”for the
time dimension) by higher-level abstractions (such as “year”), or to specialize data by replacing
higher-level abstractions with lower-level values.

Why Preprocess the Data?


Imagine that you are a manager at AllElectronics and have been charged with analyzing the
company’s data with respect to the sales at your branch. You immediately set out to perform this task.
You carefully inspect the company’s database and data warehouse, identifying and selecting the
attributes or dimensions to be included in your analysis, such as item, price, and units sold. Alas! You
notice that several of the attributes for various tuples have no recorded value. For your analysis, you
would like to include information as to whether each item purchased was advertised as on sale, yet you
discover that this information has not been recorded. Furthermore, users of your database system have
reported errors, unusual values, and inconsistencies in the data recorded for some transactions. In other
words, the data you wish to analyze by data mining techniques are incomplete (lacking attribute values
or certain attributes of interest, or containing only aggregate data), noisy (containing errors, or outlier
values that deviate from the expected), and inconsistent (e.g., containing discrepancies in the
department codes used to categorize items). Welcome to the real world! Incomplete, noisy, and
inconsistent data are commonplace properties of large realworld databases and data warehouses.
Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-
Elsevier
Incomplete data can occur for a number of reasons. Attributes of interest may not always be available,
such as customer information for sales transaction data. Other data may not be included simply
because it was not considered important at the time of entry. Relevant data may not be recorded due to
a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other
recorded data may have been deleted. Furthermore, the recording of the history or modifications to the
data may have been overlooked. Missing data, particularly for tuples with missing values for some
attributes, may need to be inferred. There are many possible reasons for noisy data (having incorrect
attribute values). The data collection instruments used may be faulty. There may have been human or
computer errors occurring at data entry. Errors in data transmission can also occur. There may be
technology limitations, such as limited buffer size for coordinating synchronized data transfer and
consumption. Incorrect data may also result from inconsistencies in naming conventions or data codes
used, or inconsistent formats for input fields, such as date. Duplicate tuples also require data cleaning.
Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they
are unlikely to trust the results of any data mining that has been applied to it. Furthermore, dirty data
can cause confusion for the mining procedure, resulting in unreliable output. Although most mining
routines have some procedures for dealing with incomplete or noisy data, they are not always robust.
Instead, they may concentrate on avoiding overfitting the data to the function being modeled.
Therefore, a useful preprocessing step is to run your data through some data cleaning routines. Section
2.3 discusses methods for cleaning up your data. Getting back to your task at AllElectronics, suppose
that you would like to include data from multiple sources in your analysis. This would involve
integrating multiple
databases, data cubes, or files, that is, data integration. Yet some attributes representing a given
concept may have different names in different databases, causing inconsistencies and redundancies.
For example, the attribute for customer identification may be referred to as customer id in one data
store and cust id in another. Naming inconsistencies may also occur for attribute values. For example,
the same first name could be registered as “Bill” in one database, but “William” in another, and “B.”
in the third. Furthermore, you suspect that some attributes may be inferred from others (e.g., annual
revenue). Having a large amount of redundant data may slow down or confuse the knowledge
discovery process. Clearly, in addition to data cleaning, steps must be taken to help avoid redundancies
during data integration. Typically, data cleaning and data integration are performed as a preprocessing
step when preparing the data for a data warehouse. Additional data cleaning can be performed to
detect and remove redundancies that may have resulted from data integration. Getting back to your
data, you have decided, say, that you would like to use a distancebased mining algorithm for your
analysis, such as neural networks, nearest-neighbor classifiers, or clustering.1 Such methods provide
better results if the data to be analyzed have been normalized, that is, scaled to a specific range such as
[0.0, 1.0]. Your customer data, for example, contain the attributes age and annual salary. The annual
salary attribute usually takes much larger values than age. Therefore, if the attributes are left
unnormalized, the distance measurements taken on annual salary will generally outweigh distance
measurements taken on age. Furthermore, it would be useful for your analysis to obtain aggregate
information as to the sales per customer region—something that is not part of any precomputed data
cube in your data warehouse. You soon realize that data transformation operations, such as
normalization and aggregation, are additional data preprocessing procedures that would contribute
toward the success of the mining process. Data integration and data transformation are discussed in
Section 2.4. “Hmmm,” you wonder, as you consider your data even further. “The data set I have
selected for analysis is HUGE, which is sure to slow down the mining process. Is there any way I can
reduce the size of my data set, without jeopardizing the data mining results?” Data reduction obtains a
reduced representation of the data set that is much smaller in volume, yet produces the same (or almost
the same) analytical results. There are a number of strategies for data reduction. These include data
aggregation (e.g., building a data cube), attribute subset selection (e.g., removing irrelevant attributes
through correlation analysis), dimensionality reduction (e.g., using encoding schemes such as
Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-
Elsevier
minimum length encoding or wavelets), and numerosity reduction (e.g., “replacing” the data by
alternative, smaller representations such as clusters or parametric models). Data reduction is the topic
of Section 2.5. Data can also be “reduced” by generalization with the use of concept hierarchies, where
low-level concepts, such as city for customer location, are replaced with higher-level concepts, such as
region or province or state. A concept hierarchy organizes the concepts into varying levels of
abstraction. Data discretization is a form of data reduction that is very useful for the automatic
generation of concept hierarchies from numerical data. This is described in Section 2.6, along with the
automatic generation of concept hierarchies for categorical data. Figure 2.1 summarizes the data
preprocessing steps described here. Note that the above categorization is not mutually exclusive. For
example, the removal of redundant data may be seen as a form of data cleaning, as well as data
reduction. In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data
preprocessing techniques can improve the quality of the data, thereby helping to improve the accuracy
and efficiency of the subsequent mining process. Data preprocessing is an important step in the
knowledge discovery process, because quality decisions must be based on quality data. Detecting data
anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for
decision making

Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-


Elsevier

You might also like