104661
104661
104661
UNIT I
DATA WAREHOUSING
Data warehousing Components –Building a Data warehouse – Mapping the Data Warehouse
to a Multiprocessor Architecture – DBMS Schemas for Decision Support – Data Extraction,
Cleanup, and Transformation Tools –Metadata.
Page 1
IT1101 Data warehousing and Data mining Unit-I
Page 2
IT1101 Data warehousing and Data mining Unit-I
Queries that would be complex in very normalized databases could be easier to build
and maintain in data warehouses, decreasing the workload on transaction systems.
Data warehousing is an efficient way to manage and report on data that is from a variety
of sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information from
lots of users.
Data warehousing provides the capability to analyze large amounts of historical data for
nuggets of wisdom that can provide an organization with competitive advantage.
Informational Data:
Focusing on providing answers to problems posed by decision makers
Summarized
Non updateable
Data Warehouse Characteristics
• A data warehouse can be viewed as an information system with the following attributes:
Page 3
IT1101 Data warehousing and Data mining Unit-I
Page 4
IT1101 Data warehousing and Data mining Unit-I
The data source for data warehouse is coming from operational applications. The data
entered into the data warehouse transformed into an integrated structure and format. The
transformation process involves conversion, summarization, filtering and condensation. The data
warehouse must be capable of holding and managing large volumes of data as well as different
structure of data structures over the time.
1. Data warehouse database
This is the central part of the data warehousing environment. This is the item number 2 in
the above arch. diagram. This is implemented based on RDBMS technology.
2. Sourcing, Acquisition, Clean up, and Transformation Tools
This is item number 1 in the above arch diagram. They perform conversions,
summarization, key changes, structural changes and condensation. The data transformation is
required so that the information can by used by decision support tools. The transformation
Page 5
IT1101 Data warehousing and Data mining Unit-I
produces programs, control statements, JCL code, COBOL code, UNIX scripts, and SQL DDL
code etc., to move the data into data warehouse from multiple operational systems.
The functionalities of these tools are listed below:
To remove unwanted data from operational db
Converting to common data names and attributes
Calculating summaries and derived data
Establishing defaults for missing data
Accommodating source data definition change.
3. Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It
is classified into two:
1. Technical Meta data: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes,
Info about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse db
Warehouse Object and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, backup history, archive history, info delivery history, data acquisition
history, data access etc.
2. Business Meta data: It contains info that gives info stored in data warehouse to users. It
includes,
Page 6
IT1101 Data warehousing and Data mining Unit-I
Subject areas, and info object type including queries, reports, images, video, audio clips
etc.
Internet home pages
Info related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a
separate data stores which is known as informational directory or Meta data repository which
helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
It is the gateway to the data warehouse environment
It supports easy distribution and replication of content for high performance and
availability
It should be searchable by business oriented key words
It should act as a launch platform for end user to access data and analysis tools
It should support the sharing of info
It should support scheduling options for request
IT should support and provide interface to other applications
It should support end user monitoring of the status of the data warehouse environment
4. Access tools
Its purpose is to provide info to business users for decision making. There are five
categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
OLAP tools
Data mining tools
Query and reporting tools are used to generate query and report. There are two types of
reporting tools. They are:
Page 7
IT1101 Data warehousing and Data mining Unit-I
Page 8
IT1101 Data warehousing and Data mining Unit-I
Page 9
IT1101 Data warehousing and Data mining Unit-I
Page 10
IT1101 Data warehousing and Data mining Unit-I
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building the
data marts before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to build
a data warehouse. Here we build the data marts separately at different points of time as and when
the specific subject area requirements are clear. The data marts are integrated or combined
together to form a data warehouse. Separate data marts are combined through the use of
conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one
that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names and
consistent values across separate data marts. The conformed dimension means exact same thing
with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it and
at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for knowing
the overall requirements of the warehouse.
We should implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one data mart.
The advantage of using the Bottom Up approach is that they do not require high initial costs and
have a faster implementation time; hence the business can start using the marts much earlier as
compared to the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de
normalized format; hence there would be high space usage for detailed data. We have a tendency
of not keeping detailed data in this approach hence losing out on advantage of having detail data
i.e. flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
complexity of the integration may become a serious obstacle.
Page 11
IT1101 Data warehousing and Data mining Unit-I
Design considerations
To be a successful data warehouse designer must adopt a holistic approach that is
considering all data warehouse components as parts of a single complex system, and take into
account all possible data sources and all known usage requirements.
Most successful data warehouses that meet these requirements have these common
characteristics:
Are based on a dimensional model
Contain historical and current data
Include both detailed and summarized data
Consolidate disparate data from multiple sources while retaining consistency
Data warehouse is difficult to build due to the following reason:
Heterogeneity of data sources
Use of historical data
Growing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative engineering
approach. In addition to the general considerations there are following specific points relevant to
the data warehouse design:
Data content
The content and structure of the data warehouse are reflected in its data model. The data
model is the template that describes how information will be organized within the integrated
warehouse framework. The data warehouse data must be a detailed data. It must be formatted,
cleaned up and transformed to fit the warehouse data model.
Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by
users to find definitions or subject areas. In other words, it must provide decision support
oriented pointers to warehouse data and thus provides a logical link between warehouse data and
decision support applications.
Page 12
IT1101 Data warehousing and Data mining Unit-I
Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary
to know how the data should be divided across multiple servers and which users should get
access to which types of data. The data can be distributed based on the subject area, location
(geographical region), or time (current, month, year).
Tools
A number of tools are available that are specifically designed to help in the
implementation of the data warehouse. All selected tools must be compatible with the given data
warehouse environment and with each other. All tools must be able to use a common Meta data
repository.
Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
Technical considerations
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include:
The hardware platform that would house the data warehouse
The DBMS that supports the warehouse data
Page 13
IT1101 Data warehousing and Data mining Unit-I
The communication infrastructure that connects data marts, operational systems and end
users
The hardware and software to support meta data repository
The systems management framework that enables admin of the entire environment
Implementation considerations
The following logical steps needed to implement a data warehouse:
Collect and analyze business requirements
Create a data model and a physical design
Define data sources
Choose the DB tech and platform
Extract the data from operational DB, transform it, clean it up and load it into the
warehouse
Choose DB access and reporting tools
Choose DB connectivity software
Choose data analysis and presentation s/w
Update the data warehouse
Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best
way to choose this is based on the type of data can be selected using this tool and the kind of
access it permits for a particular user. The following lists the various types of data that can be
accessed:
Simple tabular form data
Ranking data
Multivariable data
Time series data
Graphing, charting and pivoting data
Complex textual search data
Statistical analysis data
Page 14
IT1101 Data warehousing and Data mining Unit-I
Page 15
IT1101 Data warehousing and Data mining Unit-I
User levels
The users of data warehouse data can be classified on the basis of their skill level in
accessing the warehouse.
There are three classes of users:
Casual users: are most comfortable in retrieving info from warehouse in pre defined formats and
running pre existing queries and reports. These users do not need tools that allow for building
standard and ad hoc reports
Power Users: can use pre defined as well as user defined queries to create simple and ad hoc
reports. These users can engage in drill down operations. These users may have the experience of
using reporting and query tools.
Expert users: These users tend to create their own complex queries and perform standard
analysis on the info they retrieve. These users have the knowledge about the use of query and
report tools
Page 16
IT1101 Data warehousing and Data mining Unit-I
Page 17
IT1101 Data warehousing and Data mining Unit-I
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which each record is
placed on the next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks.
The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value
of the partitioning key for each row
Page 18
IT1101 Data warehousing and Data mining Unit-I
Key range partitioning: Rows are placed and located in the partitions according to the value of
the partitioning key. That is all the rows with the key value from A to K are in partition 1, L to T
are in partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk
etc. This is useful for small reference tables.
User defined portioning: It allows a table to be partitioned on the basis of a user defined
expression.
Page 19
IT1101 Data warehousing and Data mining Unit-I
Page 20
IT1101 Data warehousing and Data mining Unit-I
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The
Distributed Lock Manager (DLM ) is required. Examples of loosely coupled systems are VAX
clusters or Sun clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache
consistency must be maintained across the nodes and a lock manager is needed to maintain the
consistency. Additionally, instance locks using the DLM on the Oracle level must be maintained
to ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software components, such
as the bandwidth of the high-speed bus through which the nodes communicate, and DLM
performance.
Parallel processing advantages of shared disk systems are as follows:
Shared disk systems permit high availability. All data is accessible even if one node
dies.
These systems have the concept of one database, which is an advantage over shared
nothing systems.
Page 21
IT1101 Data warehousing and Data mining Unit-I
Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scale up. Oracle Parallel Server can access
the disks on a shared nothing system as long as the operating system provides transparent disk
access, but this access is expensive in terms of latency.
Page 22
IT1101 Data warehousing and Data mining Unit-I
Page 23
IT1101 Data warehousing and Data mining Unit-I
Page 24
IT1101 Data warehousing and Data mining Unit-I
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema
The multidimensional view of data that is expressed using relational data base semantics
is provided by the data base schema design called star schema. The basic of stat schema is that
information can be classified into two groups:
Facts
Dimension
Star schema has one large central table (fact table) and a set of smaller tables (dimensions)
arranged in a radial pattern around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.
The determination of which schema model should be used for a data warehouse should be based
upon the analysis of project requirements, accessible tools and project team preferences.
Page 25
IT1101 Data warehousing and Data mining Unit-I
The star schema architecture is the simplest data warehouse schema. It is called a star
schema because the diagram resembles a star, with points radiating from a center. The center of
the star consists of fact table and the points of the star are the dimension tables. Usually the fact
tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-
normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly
used nowadays and is recommended by Oracle.
Fact Tables
A fact table is a table that contains summarized numerical and historical data (facts) and a
multipart index composed of foreign keys from the primary keys of related dimension tables. A
fact table typically has two types of columns: foreign keys to dimension tables and measures
those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.
Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year),
Region dimension
(profit by country, state, city), Product dimension (profit for product1, product2). A dimension is
a structure usually composed of one or more hierarchies that categorizes data. If a dimension
hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of
the dimension tables are part of the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table. Typical
fact tables store data about sales while dimension tables data about geographic region (markets,
cities), clients, products, times, channels.
Measures
Measures are numeric data based on columns in a fact table. They are the primary data
which end users are interested in. E.g. a sales fact table may contain a profit measure which
represents profit on each sale.
Page 26
IT1101 Data warehousing and Data mining Unit-I
Aggregations are pre calculated numeric data. By calculating and storing the answers to a
query before users ask for it, the query processing time can be reduced. This is key in providing
fast query performance in OLAP.
Cubes are data processing units composed of fact tables and dimensions from the data
warehouse. They provide multidimensional views of data, querying and analytical capabilities to
clients.
The main characteristics of star schema:
Simple structure -> easy to understand schema
Great query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools
Snowflake schema: is the result of decomposing one or more of the dimensions. The many-to
one relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical structure of
dimensions very well.
Fact constellation schema: For each star schema it is possible to construct fact constellation
schema (for example by splitting the original star schema into more star schemes each of them
describes facts on another level of dimension hierarchies). The fact constellation architecture
contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design
because many variants for particular kinds of aggregation must be considered and selected.
Moreover, dimension tables are still large.
Page 27
IT1101 Data warehousing and Data mining Unit-I
Page 28
IT1101 Data warehousing and Data mining Unit-I
involved are small and a limited amount of data transformation and enhancement is
required.
Rule-driven Dynamic Transformation Engines (Data Mart Builders):
– They are also known as Data Mart Builders and capture data from a source system
at User-defined intervals, transform data, and then send and load the results into a
target environment, typically a data mart.
– To date most of the products of this category support only relational data sources,
though now this trend have started changing.
– Data to be captured from source system is usually defined using query language
statements, and data transformation and enhancement is done on a script or a
function logic defined to the tool.
– With most tools in this category, data flows from source systems to target systems
through one or more servers, which perform the data transformation and
enhancement. These transformation servers can usually be controlled from a single
location, making the job of such environment much easier.
Meta Data:
Meta Data Definitions:
Metadata – additional data warehouse used to understand what information is in the
warehouse, and what it means
Metadata Repository – specialized database designed to maintain metadata, together with
the tools and interfaces that allow a company to collect and distribute its metadata.
Operational Data – elements from operation systems, external data (or other sources)
mapped to the warehouse structures.
Industry trend:
Why were early Data Warehouses that did not include significant amounts of metadata collection
able to succeed?
• Usually a subset of data was targeted, making it easier to understand content, organization,
ownership.
• Usually targeted a subset of (technically inclined) end users
Page 29
IT1101 Data warehousing and Data mining Unit-I
Early choices were made to ensure the success of initial data warehouse efforts.
Meta Data Transitions:
Usually, metadata repositories are already in existence. Traditionally, metadata was
aimed at overall systems management, such as aiding in the maintenance of legacy
systems through impact analysis, and determining the appropriate reuse of legacy data
structures.
Repositories can now aide in tracking metadata to help all data warehouse users
understand what information is in the warehouse and what it means. Tools are now being
positioned to help manage and maintain metadata.
Meta Data Lifecycle:
1. Collection: Identify metadata and capture it in a central repository.
2. Maintenance: Put in place processes to synchronize metadata automatically with the
changing data architecture.
3. Deployment: Provide metadata to users in the right form and with the right tools.
The key to ensuring a high level of collection and maintenance accuracy is to incorporate as
much automation as possible. The key to a successful metadata deployment is to correctly
match the metadata offered to the specific needs of each audience.
Meta Data Collection:
• Collecting the right metadata at the right time is the basis for a success. If the user does
not already have an idea about what information would answer a question, the user will
not find anything helpful in the warehouse.
• Metadata spans many domains from physical structure data, to logical model data, to
business usage and rules.
• Typically the metadata that should be collected is already generated and processed by the
development team anyway. Metadata collection preserves the analysis performed by the
team.
Page 30
IT1101 Data warehousing and Data mining Unit-I
Page 31
IT1101 Data warehousing and Data mining Unit-I
This information is captured after the warehouse has been deployed. Typically, this information
is not easy to collect.
Maintaining Meta Data:
• As with any maintenance process, automation is key to maintaining current high-quality
information. The data warehouse tools can play an important role in how the metadata is
maintained.
• Most proposed database changes already go through appropriate verification and
authorization, so adding a metadata maintenance requirement should not be significant.
• Capturing incremental changes is encouraged since metadata (particularly structure
information) is usually very large.
Maintaining the Warehouse:
The warehouse team must have comprehensive impact analysis capabilities to respond to
change that may affect:
• Data extraction\movement\transformation routines
• Table structures
• Data marts and summary data structures
• Stored user queries
• Users who require new training (due to query or other changes)
What business problems are addressed in part using the element that is changing (help
understand the significance of the change, and how it may impact decision making).
Meta Data Deployment:
Supply the right metadata to the right audience
• Warehouse developers will primarily need the physical structure information for data
sources. Further analysis on that metadata leads to the development of more metadata
(mappings).
• Warehouse maintainers typically require direct access to the metadata as well.
• End Users require an easy-to-access format. They should not be burdened with technical
names or cryptic commands. Training, documentation and other forms of help, should be
readily available.
Page 32
IT1101 Data warehousing and Data mining Unit-I
End Users:
Users of the warehouse are primarily concerned with two types of metadata.
1. A high-level topic inventory of the warehouse (what is in the warehouse and
where it came from).
2. Existing queries that are pertinent to their search (reuse).
The important goal is that the user is easily able to correctly find and interpret the data
they need. Integration with Data Access Tools:
1. Side by Side access to metadata and to real data. The user can browse
metadata and write queries against the real data.
2. Populate query tool help text with metadata exported from the repository. The
tool can now provide the user with context sensitive help at the expense of
needing updating whenever metadata changes and the user may be using
outdated metadata.
3. Provide query tools that access the metadata directly to provide context
sensitive help. This eliminates the refresh issue, and ensures the user always
sees current metadata.
4. Full interconnectivity between query tool and metadata tool (transparent
transactions between tools).
Page 33