Unit 3
Unit 3
Unit 3
UNIT-3
What is a data warehouse?
The data warehouse is a informational environment that
_Provides an integrated and total view of the enterprise
_makes the enterprises current historical information easily available for decision making
_Makes decision support transactions possible without hindering operational systems
_Renders the organization information consistent
_Presence the flexible and interactive source of strategic information.
The data sourcing, cleanup, extract, transformation and migration tools have to deal with
some significant issues including:
Database heterogeneity. DBMSs e very different in data models, data access
language, data navigation, operations, concurrency, integrity, recover etc.
Data heterogeneity. This is the difference in the way data is defined and used in
different models-homonyms, synonyms, unit compatibility, different attributes for
the same entity and different ways of modeling same fact
Meta Data
Meta data is data about data that describes the data ware house. It is used for building,
maintaining, managing and using the data ware house.
Meta data can be classified into:
Technical meta data, which contains information about warehouse data for use by
warehouse designers and administrators when carrying out warehouse development
and management tasks
Business meta data, which contains information that gives users an easy to
understand perspective of the information stored in the data ware house
Equally important Meta data provides interactive access to the users to help understand
content &find data. One of the issues dealing with Meta data relates to the fact that many
data extraction tool capabilities together Meta data remain fairly immature .Therefore;
there is often the need to create a mete data interface for users, which may involve some
duplication of effort.
Meta data management is provided via a Meta data repository accompanying software.
Meta data repository management software, which typically runs on a work station, can be
used to map data to the target database: generate code for data Transformation integrate and
transform the data, and control moving data to the warehouse.
The principle purpose of data Ware housing is to provide information to business users for
strategic decision-making. These users interact the data warehouse using front-and tools.
Tools fall into for main categories: Query and reporting tools, Application development tools,
online analytical processing tools, a data mining tools.
Query and reporting tools can be divided into two groups: reporting tools and
managed query tools. Reporting tools can be further divided into production reporting tools
and report writers production reporting tools let companies generate regular operational
paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end-
users managed query tools shield and users from the complexities of SQL and database
structures by inserting a Meta layer between users and database. These tools are designed for
easy to use, point and click; operations either accept SQL or generate SQL database queries.
Often, the analytical needs of the data warehousing user community exceed the built in
capabilities of query and reporting tools. in these cases, organizations will often rely on the
tried-and-true approach of in-house application development using graphical development
environments such as power Builder, visual basic and forte.
These application development platforms integrate well with popular OLAP tools an
access all major database systems including Oracle, Sybase, and Informix. LAP tools are
based on the concepts of dimensional data models and corresponding databases, and allow
users to analyze the data using elaborate, multidimensional views. Typical business
applications include product performance and profitability, effectiveness of a sales program
or marketing campaign, sales forecasting and capacity planning. These tools assume that the
data is organized in a multidimensional model.
A critical success factor for any business today is the ability to use information
effectively. Data Mining is the process of discovering meaningful new correlations,
pattern and trends by digging into large amounts of data stored in the warehouse
using artificial intelligence, statistical and mathematical techniques.
Data marts
Data marts is a data stored that is subsidiary to a data warehouse of integrated data.
The data mart is directed at a position of data (often called a subject area) that is created for
the use of a dedicated group of users. A data mart might, in fact, be a set of renormalized,
summarized or aggregated data. The data mart is a physically separate store of data an is
resident on separate database server, often a local area network serving a dedicated user
group.
In dependent data marts, data is sourced from the data warehouse, have a high value
because no matter how they are developed a how many different enabling technologies are
used, different users are all accessing the information views derived from the single
integrated version of data.
Unfortunately the misleading statements about the simplicity and low cost of data
marts sometimes result in organizations or vendors incorrectly them as an alternative to data
warehouse. The view point defines independent data marts that in fact, represent fragmented
point solutions to a range of business problems in the enterprise. This type of implementation
s should be rarely developed in the contest of an overall technology or application
architecture. Moreover, the concept of an independent of data mart is dangerous because as
soon as the first data mart is created, other organizations, groups, and the subject areas within
the enterprises embark on the task of building their own data marts .As a result, you create an
environment where multiple optional systems feed multiple non-integrated data marts that are
often overlapping in data content, job scheduling, connectivity and management.
The nee manage this environment is obvious. Managing data warehouse includes security and
priority management, monitoring updates from the multiple sources, data quality checks,
managing an updating meta data, auditing an reporting data warehouse usage and status,
purging data, replicating, sub setting an distributing data, backup and data warehouse storage
management.
may be based on time of day on the completion of an external event. The rational for
the delivery systems component is based on fact that once the date warehouse is
installed and operational, it’s users don’t view of data at a specific point in time.
In order to provide information to wide community if data warehouse users the
information delivery components includes different methods of information delivery.
Ad hoc reports are predefined reports primarily meant for novice and casual users. Provision
for complex queries multidimensional (MD) Analysis and statistical analysis cater to the
needs of business analysis. Information fed into the executive information systems (EIS) is
meant for senior executives and high level managers. Data mining applications helps to
discover trends and patterns from the usage of your data.
Snow flake schema: A refinement of a star schema where some dimensional hierarchy is
normalized into set of smaller tables, forming a shape similar snowflake
Fact constellations: Multiple fact tables share dimension table, viewed as a collection of stars,
Pivoted(rotate):
Reorient the cube, visitation, 3D to series of 2D planes.
Other operations
Drill-within: It is switching from one classification to different one within the same
dimension.
Drill across: involving (across) more than one fact table
Drill through: through bottom level of the cube to its back-end relational tables
(using SQL)
Multi-Tiered Architecture
Data warehouse often adopt a three-tier architecture
1. The bottom tier is a warehouse database server that us almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (such as customer profile
information provided by external consultants). These tools and utilities perform data
extraction, cleaning an transformation. This tier also contains a metadata respiratory,
which stores information about the data warehouse and its contents
2. The middle tier is an OLAP server that is typically implemented using either(1) a
relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps
operations on multidimensional data to standard relational operations; or (2) a
multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implemented multidimensional data and operations.
3. The top tier is a front-end client layer, which contains query and reporting, analysis
tools, and/or data mining tools (e.g. trend analysis, prediction, and so on).
-------------------------------------------------------------------------------------------------------
-----
Middletier:
OLAP
server
--------------------------------------------------------------------------------------------------------------
Monitoring administration data where house data marts
Metadata repository
Extract
----------------------------------------------------------------------------------------------------------------
clean data
transform
Operational database load
External sources refresh
The partial materialization of cuboids or sub cubes should consider three factors:
1. Identify the subset of cuboids or sub cubes to materialize;
2. Exploit the materialized cuboids or sub cubes during query processing
3. Efficiency updates the materialized cuboids or sub cubes during load and
refresh.
1.12.2 Index OLAP Data
To facilitate efficient data accessing, most data warehouse system support index
structures and materialized views (using cuboids).
Type of indexing OLAP Data
The bitmap indexing method is popular in OLAP products because it allows quick
searching in data cubes. The bitmap index is an alternative representation of the
record ID (RID) list. In the bitmap index for a given attribute, there is a distinct bit
vector By. for each value v in the domain of the attribute. If the domain of the given
attributes consists of n values, then n bits are needed for each entry in the bit image
index (i.e. there are n bits vector). If the attributes has the values v for a given row in
the data table, then the bit representing that values is set to 1 in the corresponding row
of the bit image index. All other bits for that row are set to 0.
Advantages
Useful for low cardinality domains.
Reduction in apace a processing time.
Base (data) table containing the dimensions items and city, and its mapping to bitmap index
table of the dimensions are given below.
THE JOIN INDEXING method gained popularity from its use in relational data base query
processing. Join indexing registers the joinable rows relation data base. For example, if two
relation R(RID ,A) and S(B,SID) join on the attributes A and B, then the join index record
contains the pair (RID,SID),where RID and SID are record identifiers from the R and S
relations, respectively. Hence, the join index records can identify joinable topples without
performing costly join operations. Join indexing is useful for maintaining the relationship
between a foreign key and its matching primary keys, form the joinable relation.
The purpose of materializing cuboids and constructing OLAP index structure is to speed Up
query processing in data cubes. Given materialized views, query processing should proceed
as follows;
1. Determine which operation should be performed on the available: this involves
transforming any selection, projection, roll-up(group-by), and drill –down operations
specified in the query into corresponding SQL and/or OLAP operations.
1.12.4 Metadata Repository
Meta data is the data defining warehouse objects .it has the following kinds.
- Describing of the structure of the warehouse.
Schema, view ,dimensions ,hierarchies, derived data define ,data mart locations and
contents
Operational meta data
Data linkage (history of migrated data and transformational path), currency of data
(activ
e, archived, or purged), monitoring informational (ware house usage statistics ,error
retrials)
- The algorithms used for summarization.
- The mapping from operational environmental to the data ware house.
- Data related to system performance.
Warehouse schema, view and derive data warehouse.
Business data
Business terms and definitions, ownership of data, charging policies.
Data ware house systems use back-end tools and utilities to populate an refresh
their data. These tools and utilities include the following functions:
Data extractions- gather data from multiple, heterogeneous, and external sources. Data
Cleaning- Detects errors in the data and rectifies them when possible
Data transformation-converts data from legacy or host format to ware house format.
Data Loading –sorts, summaries, consolidates, computes views, checks integrity, and builds
indices and partitions.
Refresh- propagates the updates from the data sources to the ware house
1.13 From data warehousing to data mining
Layer 3
Graphical user interface API
OLAM/OLAP
OLAP
engine
OLAM
engine
MOLAP:
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multi-dimensional cube. The storage is not in the relational database, but in
proprietary formats.
Advantages:
Excellent performance; MOLAP cubes are built for fast data retrieval, and are optimal
for slicing and dicing operations.
Can perform complex calculations: all calculations have been pre-generated when the
cube is created. Hence, complex calculations are not only doable, but they return
quickly.
Disadvantages:
Limited in the amount data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the
cube itself. This is not to say that the data in cube cannot be derived from a large
amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in cube itself.
Requires additional investment: cube technologies are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology,
chances are additional investments in human and capital resources are needed.
ROLAP :
This methodology relies on manipulating the in the stored in the relational database to give
the appearance of traditional OLAP’s slicing and dicing functionality. In essence, each action
of slicing and dicing is equivalent to adding a “WHERE” clause in the SQL statement.
Advantages:
Can handle large amount of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
Can leverage functionalities inherent in the relational database: often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: because ROLAP technology mainly relies on generating
SQL query (or multiple SQL queries) in the relational database, the query time can be
long if the underlying data size is large.
Limited by SQL functionalities: because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements do
not fit all needs (for example, it is difficult to perform complex calculations using
SQL),ROLAP technologies are therefore gravitationally limited by what SQL can do.
ROLAP vendors have mitigated this risk by building into the tool out-of-the-box
complex functions as well as the ability into allow users to define their own functions.
HOLAP:
Hybrid online analytical processing (HOLAP) is a combination of relational
OLAP (ROLAP) and multidimensional OLAP (usually referred to simply as OLAP).
HOLAP was developed to combine the greater data capacity of ROLAP with the
superior processing capability of OLAP.
HOLAP can use varying combinations of ROLAP and OLAP technology. Typically it stores
data in a both a relational database (RDB) and multidimensional database (MDDB) and uses
whichever one is best suited to the type of processing desired. The databases used to store
data in the most functional way. For data-heavy processing, the data is more efficiently stored
in a RDB, while for speculative processing; the data is more defectively stored in an MDDB.
HOLAP users can choose to store the results of queries to the MDDB to save the effort
of looking for the same data over and over which saves time. Although this technique- called
“materializing cells” – improves performance, it takes a tool on storage. The user as to strike
a balance between performance and storage demand to get the most out of HOLAP
nevertheless, because it offers the best features of OLAP and ROLAP, HOLAP is
increasingly preferred.