DATA Science Unit - II Part 1
DATA Science Unit - II Part 1
DataWarehousingandOnlineAnalyticalProcessing:BasicConcepts
What Is a DataWarehouse?
Loosely speaking, a data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases.
According to WilliamH. Inmon, a leading architect inthe constructionofdata warehouse systems, “A
data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in
support of management’s decision making process”.
Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales. Rather than concentrating on the day-to-day operations and transaction processing
ofanorganization, adatawarehousefocusesonthe modelingandanalysisofdatafor decisionmakers.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,
such as relational databases, flat files, and online transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
attribute measures, and so on.
Time-variant: Data are stored to provide information froman historic perspective (e.g., the past 5–10
years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse does
not require transaction processing, recovery, and concurrency control mechanisms. It usually requires
onlytwo operations in data accessing: initial loading of data and access of data.
data warehousing as the process of constructing and using data warehouses. The construction of a
data warehouse requires data cleaning, data integration, and data consolidation.
“How are organizations using the information from data warehouses?” Many organizations use this
information to support business decision-making activities, including (1) increasing customer focus,
which includes the analysis of customer buying patterns (such as buying preference, buying time,
budget cycles, and appetites for
spending); (2) repositioning products and managing product portfolios by comparing the performance
of sales by quarter, by year, and by geographic regions in order to fine-tune production strategies; (3)
analyzing operations and looking for sources of profit; and (4) managing customer relationships,
making environmentalcorrections, and managing the cost of corporate assets.
The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases. When a query is posed to a
client site, a metadata dictionary is used to translate the query into queries appropriate for the
individual heterogeneous sites involved. These queries are then mapped and sent to local query
processors. The results returned from the different sites are integrated into a global answer set. This
query-driven approach requires complex information filtering and integration processes, and
competes with local sites for processing resources. It is inefficient and potentially expensive for
frequent queries, especially queries requiring aggregations.
Data warehousing provides an interesting alternative to this traditional approach. Rather than using a
query-driven approach, data warehousing employs an update-driven approach in which information
from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct
querying and analysis.
DifferencesbetweenOperationalDatabaseSystemsand DataWarehouses
The major task of online operational database systems is to perform online transaction and query
processing.Thesesystemsarecalledonlinetransactionprocessing(OLTP)systems.They cover
Users and system orientation: An OLTP system is customer-oriented and is used for transaction and
query processing by clerks, clients, and information technology professionals. An OLAP system is
market-oriented and is used for data analysis by knowledge workers, including managers, executives,
and analysts.
Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an
application-oriented database design. An OLAP system typically adopts either a star or a snowflake
model and a subject-oriented database design.
View: An OLTP system focuses mainly on the currentdata within an enterprise or department, without
referring to historic data or data in different organizations. In contrast, an OLAP systemoften spans
multiple versions ofa database schema, due to the evolutionaryprocess ofan organization.
Access patterns: The access patterns ofan OLTP systemconsist mainly ofshort, atomic transactions.
Such a systemrequires concurrency control and recovery mechanisms. However, accesses to OLAP
systems are mostly read-only operations (because most data warehouses store historic rather than up-
to-date information), although many could be complex queries.
Data Warehouse Architecture In this section, we discuss issues regarding data warehouse architecture.
Section 3.3.1 gives a general account of how to design and construct a data warehouse. Section 3.3.2
describes a three-tier data warehouse architecture. Section 3.3.3 describes back-end tools and utilities for
data warehouses. Section 3.3.4 describes the metadata repository. Section 3.3.5 presents various types of
warehouse servers for OLAP processing
Steps for the Design and Construction of Data Warehouses This subsection presents a business
analysis framework for data warehouse design. The basic steps involved in the design process are also
described.
The Design of a Data Warehouse: A Business Analysis Framework
“What can business analysts gain from having a data warehouse?” First, having a data warehouse may
provide a competitive advantage by presenting relevant information from which to measure performance
and make critical adjustments in order to help win over competitors. Second, a data warehouse can
enhance business productivity because it is able to quickly and efficiently gather information that
accurately describes the organization. Third, a data warehouse facilitates customer relationship
management because it provides a consistent view of customers and items across all lines of business, all
departments, and all markets. Finally, a data warehouse may bring aboutcost reduction by tracking
trends, patterns, and exceptions over long periods in a consistent and reliable manner. To design an
effective data warehouse we need to understand and analyze business needs and construct a business
analysis framework. The construction of a large and complex information system can be viewed as the
construction of a large and complex building, for which the owner, architect, and builder have different
views. These views are combined to form a complex framework that represents the top-down, business-
driven, or owner’s perspective, as well as the bottom-up, builder-driven, or implementor’s view of the
information system. Four different views regarding the design of a data warehouse must be considered:
the top-down view, the data source view, the data warehouse view, and the business query view. The
top-down view allows the selection of the relevant information necessary for the data warehouse. This
information matches the current and future business needs. The data source view exposes the information
Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-
Elsevier
being captured, stored, and managed by operational systems. This information may be documented at
various levels of detail and accuracy, from individual data source tables to integrated data source tables.
Data sources are often modeled by traditional data modeling techniques, such as the entity-relationship
model or CASE (computer-aided software engineering) tools. The data warehouse view includes fact
tables and dimension tables. It represents the information that is stored inside the data warehouse,
including precalculated totals and counts, as well as information regarding the source, date, and time of
origin, added to provide historical context. Finally, the business query view is the perspective of data in
the data warehouse from the viewpoint of the end user.
Building and using a data warehouse is a complex task because it requires business skills, technology
skills, and program management skills. Regarding business skills, building a data warehouse involves
understanding how such systems store and manage their data, how to build extractors that transfer data
from the operational system to the data warehouse, and how to build warehouse refresh software that
keeps the data warehouse reasonably up-to-date with the operational system’s data. Using a data
warehouse involves understanding the significance of the data it contains, as well as understanding and
translating the business requirements into queries that can be satisfied by the data warehouse. Regarding
technology skills, data analysts are required to understand how to make assessments from quantitative
information and derive facts based on conclusions from historical information in the data warehouse.
These skills include the ability to discover patterns and trends, to extrapolate trends based on history and
look for anomalies or paradigm shifts, and to present coherent managerial recommendations based on
such analysis. Finally, program management skills involve the need to interface with many technologies,
vendors, and end users in order to deliver results in a timely and cost-effective manner.
The Process of Data Warehouse Design A data warehouse can be built using a top-down approach, a
bottom-up approach, or a combination of both. The top-down approach starts with the overall design and
planning. It is useful in cases where the technology is mature and well known, and where the business
problems that must be solved are clear and well understood. The bottom-up approach starts with
experiments and prototypes. This is useful in the early stage of business modeling and technology
development. It allows an organization to move forward at considerably less expense and to evaluate the
benefits of the technology before making significant commitments. In the combined approach, an
organization can exploit the planned and strategic nature of the top-down approach while retaining the
rapid implementation and opportunistic application of the bottom-up approach. From the software
engineering point of view, the design and construction of a data warehouse may consist of the following
steps: planning, requirements study, problem analysis, warehouse design, data integration and testing,
and finally deployment of the data warehouse. Large software systems can be developed using two
methodologies: the waterfall method or the spiral method. The waterfall method performs a structured
and systematic analysis at each step before proceeding to the next, which is like a waterfall, falling from
one step to the next. The spiral method involves the rapid generation of increasingly functional systems,
with short intervals between successive releases. This is considered a good choice for data warehouse
development, especially for data marts, because the turnaround time is short, modifications can be done
quickly, and new designs and technologies can be adapted in a timely manner.
process is organizational and involves multiple complex object collections, a data warehouse model
should be followed. However, if the process is departmental and focuses on the analysis of one kind of
business process, a data mart model should be chosen.
2. Choose the grain of the business process. The grain is the fundamental, atomic level of data to be
represented in the fact table for this process, for example, individual transactions, individual daily
snapshots, and so on.
Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-
Elsevier
3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item,
customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are numeric additive
quantities like dollars sold and units sold.
Because data warehouse construction is a difficult and long-term task, its implementation scope should
be clearly defined. The goals of an initial data warehouse implementation should be specific, achievable,
and measurable. This involves determining the time and budget allocations, the subset of the
organization that is to be modeled, the number of data sources selected, and the number and types of
departments to be served. Once a data warehouse is designed and constructed, the initial deployment of
the warehouse includes initial installation, roll-out planning, training, and orientation. Platform upgrades
and maintenance must also be considered. Data warehouse administration includes data refreshment, data
source synchronization, planning for disaster recovery, managing access control and security, managing
data growth, managing database performance, and data warehouse enhancement and extension. Scope
management includes controlling the number and range of queries, dimensions, and reports; limiting the
size of the data warehouse; or limiting the schedule, budget, or resources. Various kinds of data
warehouse design tools are available. Data warehouse development tools provide functions to define and
edit metadata repository contents (such as schemas, scripts, or rules), answer queries, output reports, and
ship metadata to and from relational database system catalogues. Planning and analysis tools study the
impact of schema changes and of refresh performance when changing refresh rates or time windows.
DataWarehousing: AMultitieredArchitecture
1. ThebottomtierisawarehousedatabaseserverthatisalmostalwaysarelationalDatabase system.Back-
endtoolsandutilitiesareusedto feeddataintothebottomtier fromoperational
DataWarehouseModels:EnterpriseWarehouse,DataMart andVirtualWarehouse
Fromthe architecture point of view, there are three data warehouse models: the enterprise warehouse,
the data mart, and the virtual warehouse.
Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
spanningthe entire organization. It provides corporate-wide dataintegration, usuallyfromoneormore
operational systems or external information providers, and is cross-functional in scope. It typically
contains detailed data as well as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on
traditional mainframes, computer superservers, or parallel architecture platforms. It requires extensive
business modeling and may take years to design and build.
Data mart: Adata mart contains a subset ofcorporate-wide datathat is of value to a specific groupof
users. The scope is confined to specific selected subjects. For example, a marketing data mart may
confine its subjects to customer, item, and sales. The data contained in data marts tend to be
summarized.
Data marts are usually implemented on low-cost departmental servers that are Unix/LinuxorWindows
based. The implementation cycle of a data martis more likely to be measured in weeks rather than
months or years. However, it may involve complex integration in the long run if its design and
planning were not enterprise-wide.
Datamartsaretwotypes.Theyare
1. Independentdatamart
2. Dependentdatamart
1. Independent data marts are sourced from data captured from one or more operational systems
orexternalinformationproviders, orfromdata generatedlocallywithinaparticulardepartment or
geographic area.
2. Dependentdata martsaresourced directlyfromenterprise datawarehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient
queryprocessing, onlysomeofthepossible summaryviews may be materialized. Avirtualwarehouse is
easyto build but requires excess capacity on operationaldatabase servers.
“What are the pros and cons of the top-down and bottom-up approaches to data warehouse
development?”
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner. First, a high-level corporate data model is
defined within a reasonably short period (such as one or two months) that provides a corporate-wide,
consistent, integratedview ofdata among different subjectsand potentialusages. Second, independent
data marts can be implemented in parallel with the enterprise warehouse based on the same corporate
data modelset noted before. Third, distributed data marts canbe constructedto integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.
Extraction, Transformation,andLoading
Datawarehousesystemsuse back-endtoolsandutilitiesto populateandrefreshtheir data.Thesetools and
utilities include the following functions:
Dataextraction,whichtypicallygathersdatafrommultiple,heterogeneous,andexternalsources.
Datacleaning,whichdetectserrorsinthedataandrectifiesthemwhenpossible.
Datatransformation,whichconvertsdatafromlegacyor hostformattowarehouse format.
Load,whichsorts,summarizes,consolidates,computesviews,checksintegrity,andbuildsindicesand
partitions.
Refresh,whichpropagatesthe updates fromthe datasourcestothewarehouse.
Metadata Repository
Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouseobjects.Metadataarecreatedforthedatanamesanddefinitionsof thegivenwarehouse.
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational DBMS
to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers
include optimization for each DBMS back end, implementation of aggregation navigation logic, and
additional tools and services. ROLAP technology tends to have greater scalability than MOLAP
technology. The DSS server of Microstrategy, for example, adopts the ROLAP approach.
Multidimensional OLAP
(MOLAP) servers: These servers support multidimensional views of data through array-based
multidimensional storage engines. They map multidimensional views directly to data cube array
structures. The advantage of using a datacube is that it allows fast indexing to precomputed summarized
data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is
sparse. In such cases, sparse matrix compression techniques should be explored (Chapter 4). Many
MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser
subcubes are identified and stored as array structures, whereas sparse subcubes employ compression
technology for efficient storage utilization.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
Reference: DataMining–ConceptsandTechniques–3rdEdition,Jiawei Han, MichelineKamber&Jian Pei-
Elsevier
technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP.
For example, a HOLAP server may allow large volumes of detail data to be stored in a relational
database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000
supports a hybrid OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases,
some database system vendors implement specialized SQL servers that provide advanced query language
and query processing support for SQL queries over star and snowflake schemas in a read-only
environment.
“How are data actually stored in ROLAP and MOLAP architectures?” Let’s first look at ROLAP. As its
name implies, ROLAP uses relational tables to store data for on-line analytical processing. Recall that
the fact table associated with a base cuboid is referred to as a base fact table. The base fact table stores
data at the abstraction level indicated by the join keys in the schema for the given data cube. Aggregated
data can also be stored in fact tables, referred to as summary fact tables. Some summary fact tables store
both base fact table data and aggregated data, as in Example 3.10. Alternatively, separate summary fact
tables can be used for each level of abstraction, to store only aggregated data.
DataWarehouseModeling:DataCubeandOLAP
DataCube:AMultidimensionalData Model
Stars,Snowflakes,andFactConstellations: SchemasforMultidimensionalDataModels
A datawarehouse,however,requiresaconcise,subject-orientedschemathatfacilitatesonlinedata analysis.
The most popular data model for a data warehouse is a multidimensional model, whichcan existin the
formofa star schema, a snowflake schema, or a fact constellation schema.
Starschema: The most common modeling paradigm is the star schema, inwhichthe data warehouse
contains(1)alargecentraltable(facttable)containingthebulkofthedata,with noredundancy,and
(2)aset ofsmallerattendanttables(dimensiontables),oneforeachdimension.
Example 4.1 Star schema.A star schemafor AllElectronics salesis shown in Figure 4.6. Sales are
considered
Notice that in the star schema, each dimension is represented by only one table, andeach table
containsasetofattributes.Forexample,thelocationdimensiontablecontainstheattributeset
{location_key, street, city, province_or_state, country}. This constraint may introduce some
redundancy. For example, “Urbana”and “Chicago”arebothcities inthe stateofIllinois, USA. Entries for
such cities in the location dimension table will create redundancy among the attributes
province_or_state and country;that is, (...., Urbana, IL, USA) and (, Chicago, IL, USA).Moreover, the
attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order).
In data warehousing, there is a distinction between a data warehouse and a data mart. Adata
warehousecollectsinformationaboutsubjectsthatspantheentireorganization,suchascustomers,
Dimensions:TheRoleofConceptHierarchies
Aconcept hierarchy defines a sequence ofmappings froma setoflow-levelconcepts to higher-level,
more general concepts. Consider a concept hierarchy for the dimension location. City values for
location include Vancouver, Toronto, New York,and Chicago. Each city, however, can be mapped to
theprovinceorstateto whichit belongs. Forexample, Vancouvercanbe mappedto BritishColumbia, and
Chicago to Illinois.
The provinces and states can in turn be mapped to the country(e.g., Canada or the United States) to
which they belong. These mappings form a concept hierarchy for the dimension location, mapping a
set of low-level concepts (i.e., cities) to higher-level, more generalconcepts (i.e., countries).
Many concept hierarchies are implicit within the database schema. For example, suppose that the
dimensionlocation isdescribedbytheattributes number, street, city,province_or_state, zip_code,and
country.Theseattributesarerelatedbyatotalorder,forming aconcepthierarchysuchas“street<city
<province or state <country.” This hierarchy is shown in Figure 4.10(a). Alternatively, the attributes
of a dimension may
Measures can be organized into three categories—distributive, algebraic, and holistic— based on the
kind of aggregate functions used.
Ameasureisholisticifitisobtainedbyapplyingaholisticaggregate function.
An aggregate function is holistic ifthere is no constant bound onthe storage size needed to describe a
subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a
constant) that characterizes the computation. Common examples of holistic functionsinclude median(),
mode(), and rank().
TypicalOLAPOperations
“How are concept hierarchies useful in OLAP?” In the multidimensional model, data are organized
into multiple dimensions, and each dimension contains multiple levels of abstraction defined by
concept hierarchies. This organization provides users with the flexibility to view data from different
perspectives.
Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs
aggregationonadatacube, either by climbingupaconcepthierarchy for adimensionor bydimension
reduction.
This hierarchy was defined as the total order “street <city <province or state <country.” The roll-up
operation shown aggregates the data by ascending the location hierarchy from the level of city to the
level of country
When roll-up is performed by dimension reduction, one or more dimensions are removed from the
givencube. For example, consider a salesdata cube containing onlythe location and time dimensions.
Roll-up may be performed by removing, say, the time dimension, resulting in an aggregation of the
totalsales by location, rather than by location and bytime.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions. concept hierarchy for time defined as “day <month <quarter
<year.” Drill-down occurs by descending the time hierarchy fromthe level of quarter to the more
detailed level of month. Because a drill-down adds more detail to the given data, it can also be
performed by adding new dimensions to a cube.
Slice and dice: The slice operation performs a selection on one dimension ofthe given cube, resulting
in a subcube. Figure 4.12 shows a slice operation where the sales data are selected from the central
cube for the dimension time using the criterion time = “Q1.” The dice operation defines a subcube by
performing a selection on two or more dimensions. Figure 4.12 shows a dice operation on the central
cube based on the following selection criteria that involve three dimensions: (location = “Toronto” or
“Vancouver”) and (time = “Q1” or “Q2”) and (item = “home entertainment” or “computer”).
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view
to provide an alternative data presentation. Figure 4.12 shows a pivot operation where the item and
location axes in a 2-D slice are rotated.
OLAP offers analytical modeling capabilities, including a calculation engine for deriving ratios,
variance, and so on, and for computing measures across multiple dimensions. OLAP also supports
functional models for
forecasting,trendanalysis,andstatisticalanalysis.
AStarnetQueryModelforQueryingMultidimensionalDatabases
Concept hierarchies can be used to generalize data by replacing low-level values (such as “day”for the
time dimension) by higher-level abstractions (such as “year”), or to specialize data by replacing
higher-level abstractions with lower-level values.