Data Warehousing
Data Warehousing
Data Warehousing
Data Warehouse
A data warehouse is a collection of data that supports decision-making processes.
An integrated collection of databases rather than a single database is a Data
Warehouse. It must be considered as the single source of information for all decision
support processing and all informational applications throughout the organization. It
provides the following features :
It is subject-oriented.
It is integrated and consistent.
It shows its evolution over time and it is not volatile.
It used to support management
Decision-making processes
Business intelligence.
Emerge the information and knowledge needed to effectively manage
the organization.
Investigation of key challenges and research directions for this
discipline.
It Comprises of data that belong to different imformation subject areas.
It contains different catagories of data.
The process to access heterogenous data , its cleanng transformation , and storing
the data in a structure that is easy to access , understand and use us carried out by
the data warehouse. This data id finally used for report generation , querying and
data analysis.
Warehouse catalog
The warehouse catalog is the subsystem that stores and manages all the metadata.
The metadata refers to such information as data element mapping from source to
target, data element meaning information for information systems and business
users, the data models (both logical and physical), a description of the use of the
data, and temporal information.
Two Layer
The requirement for separation plays a fundamental role in defining the typical
architecture for a data warehouse system. Although it is typically called two-layer
architecture to highlight a separation between physically available sources and data
warehouses, it consists of the following 4 layers :
1) Source layer
Legacy databases
Information systems outside the corporate walls
2) Data Staging
The data stored to sources should be extracted, cleansed to remove inconsistencies
and fill gaps, and integrated to merge heterogeneous sources into one common
schema. The so called Extraction, Transformation, and Loading tools (ETL) can
merge heterogeneous schemata, extract, transform, cleanse, validate, filter, and load
source data into a data warehouse. ETL takes place once when a data warehouse is
populated for the first time, then it occurs every time the data warehouse is regularly
updated ETL consists of four separate phases: extraction (or capture), cleansing (or
cleaning or scrubbing), transformation, and loading.
Extraction
Relevant data is obtained from sources in the extraction phase. You can use static
extraction when a data warehouse needs populating for the first time. Incremental
extraction, used to update data warehouses regularly. The data to be extracted is
mainly selected on the basis of its quality.
Cleansing
The cleansing phase is crucial in a data warehouse system because it is supposed
to improve data quality . few mistakes and inconsistencies that make data dirty are :
Duplicate data
Inconsistent values that is logically
Missing data Such as a customers job in ETL tools.
Impossible or wrong values
Loading
Loading into a data warehouse is the last step to take. Loading can be carried out in
two ways:
Refresh Data warehouse data is completely rewritten. This means that older data
is replaced.
Update Only those changes applied to source data are added to the data
warehouse. Update is typically carried out without deleting or modifying preexisting
data. This technique is used in combination with incremental extraction to update
data warehouses regularly.
3) Data warehouse layer :
have emerged to meet this need. Surrounded by analytical tools and models, data
warehouses have the potential to transform operational data into business
intelligence; enabling effective problem and opportunity identification and critical
decision making, as well as strategy formulation, implementation, and evaluation.
Content Management :
Managing the content of a data warehouse is a daunting task. These
operational systems draw data from a variety of databases that operate on
different hardware platforms, use different operating systems and DBMSs,
and have different database structures with varying structural, conceptual,
and instance level semantics.
Major challenges remain for data warehouse content management:
These include identifying and accessing the appropriate data sources,
coordinating data capture from them in an appropriate timeframe.
A data warehouse serves as a repository for data extracted from diverse
operational information systems.
The extraction, transformation, and loading (ETL) functions in a data
warehouse are considered the most time-consuming and expensive portion of
the development lifecycle.
Often such operational systems were not designed to be integrated and data
extracts are performed manually or on a schedule determined by the
operational systems.
As a result data in the data warehouse may reflect different states of different
systems. Data extracted from an inventory system.
Coordination mechanisms must be established.
Clearly the data warehouse must go beyond its current role as a repository of
historical data describing the operations and transactions in which the organization
has engaged. It must include data describing partners and partnerships, policies and
rules of the business, competitors and markets, goals and standards, opportunities
and problems, and alternatives and predicted futures
Support
Organizations are using data warehousing to support strategic and mission-critical
applications. Data deposited into the data warehouse must be transformed into
information and knowledge and appropriately disseminated to decision makers within
the organization and to critical partners in various supply chains . problems that need
to be addressed n this area are
1) Selection of proper analytical and data minig tools
2) Privacy and security of data
3) System performance
4) Adequate level of training and support
integrated,
subject-oriented,
of strategic information that serves as a single source for the decision support environment.
Data warehouse model = abstract model, supported by graphical and lexical documentation
representing the data warehouse content that is involved in analytics applications
Difference between Data warehousing and OLTP model
Global
it is designed and created based on the holistic needs of the enterprise. This can act
as a common repository for decision support data across the entire enterprise.
The term global in this warehouse architecture doesnt apply to be only a centralized
scheme (or a physical location), but reflects the scope and access of data across the
organization. The data warehouse could also be distributed across different physical
locations.
The major issue in setting up this kind of data warehouse is time and cost involved
when it is spanning multiple geographic locations.
Top-Down Implementation
A top down implementation requires more planning and design work to be completed
at the beginning of the project. This brings with it the need to involve people from
each of the workgroups, departments, or lines of business that will be participating in
the data warehouse implementation. Decisions concerning data sources to be used,
security, data structure, data quality, data standards, and an overall data model will
typically need to be completed before actual implementation begins. However, the
cost of the initial planning and design can be significant. It is a time-consuming
process and can delay actual implementation, benefits, and return-on-investment.
Bottom Up Implementation
A bottom up implementation involves the planning and designing of data marts
without waiting for a more global infrastructure to be put in place. This does not
mean that a more global infrastructure will not be developed; it will be built
incrementally as initial data mart implementations expand. This approach is more
widely accepted today than the top down approach because immediate results from
the data marts can be realized and used as justification for expanding to a more
global implementation. The bottom up implementation approach has become the
choice of many organizations, especially business management, because of the
faster payback.
Considerations While Choosing data warehouse modelling approach
Facts contain:
Dimension Keys
Each dimension key is a references to a dimension
grain Measures and supportive measures
2) Dimension : A dimension is a collection of members or units of the same type
of views. A dimension provides a certain business context to each measure
common dimensions could be:
Time
Location/region
Customers
Salesperson
Grain of a dimension : The grain of a dimension is the lowest level of detail available within
that dimension
Drill-down
Exploring facts at more detailed levels
Roll-up
Aggregating facts at less detailed levels
Slice and dice are the operations for browsing the data through the visualized cube.
Slicing cuts through the cube so that users can focus on some specific perspectives.
Dicing rotates the cube to another perspective so that users can be more specific with the
data analysis.
Requirement Analysis
Requirement analysis is used to build an initial dimensional model that represents
the end user requirements which were previously captured in an informal way. This
output of this phase acts as an input for requirement modeling activities, once they
have passed the requirements validation phase. The deliverables of this phase
consist of a combination of
Initial dimensional data models
Business directory or metadata definitions of all elements of the Multi-dimension
model
The end user requirements can be classified into two major categories:
Process-oriented requirements: These represent the major information
processing elements which the end users are performing. Process oriented
requirements may be Business Objectives or Business Queries (represent
Information-oriented requirements: These represent the major data items which
the end users require for their data analysis activities.
The ultimate scope of requirement analysis be summarized as
Gather and interpret business requirements and formulate a business question.
Candidate measures, facts, and dimensions are determined.
Grains of dimensions and granularities of measures and facts are indomitable.
determined first. Then, the facts are established. This follows the natural query
oriented approach by picking the end user queries as the first source of information.
Business-oriented approach
This approach tries to capture the fundamental elements of the business problem
basically. Firstly, the facts are determined through the analysis of the problem
domain from the business point of view. Then, the dimensions and measures are
added to the model.
Data source oriented approach
This approach focuses on the source databases models to determine the
dimensions followed by the measures and facts.
Requirements modelling
After the requirements have been validated, the requirements can be represented as
a model. The model can be an initial multi-dimension model or a concrete model
represented using cubes or a mathematical notation technique representing points in
a multi-dimension space. These representations may be appealing, especially
cubes, but their complexity increases exponentially as the dimensionality increases.
For simplicity well keep the model as a cubical dimensional model.
The requirements modelling activities can be distinguished into two broad groups as
(i) Base Techniques - used for producing the logical models for the dimensions in
the initial model. These techniques of dimension modeling involve:
Adding Dimension attributes which aid in selecting the relevant facts
Dimension browsing exploring the dimension to detect and set the appropriate
selection and aggregation constraints used in subsequent analysis of facts.
Once the dimension attributes and facts are gathered a detailed dimension model is
prepared.
(ii) Detailed dimension modeling which should incorporate structure of the
dimension as well as all of its attributes.
The proposed approach for modeling the dimensions consists of the following
activities for each dimension hierarchy:
Create an entity for each of the aggregation levels within the hierarchy and add
identifiers for each of the dimension entities.
Link the entities in a hierarchical structure and add required attributes to each
dimension entity(useful/relevant/requested by end user)
Demote aggregation levels which do not have any associated attributes from the
dimension entities into dimension attributes.
This kind of approach leads to the so called snowflake models, because of its
support to standardize the dimension hierarchies and aggregation levels
There are two basic models that can be used in dimensional modeling:
Star model
Snowflake model
Star Schema
Star schema has become a common term used to connote a dimensional model.
Database designers have long used the term star schema to describe dimensional
models because the resulting structure looks like a star and the logical diagram looks
like the physical schema. Each of the dimension table is a denormalized construct
which holds all the attributes of all the aggregation levels in all of the hierarchies of a
given dimension.
Very pragmatic approach
Easier to use
Snowflake Schema
This is a representation of a multidimensional data model in which the dimension
hierarchies are structured and normalized. When the schema has data that is
normalized, there is always a possibility of minimal redundancy when compared to
Star schema. This model is highly useful in situations where the models have
dimensions that are really very complex. Increased modeling and design flexibility
Hints and Tips while making a multi Dimensional model
Properties of measures
Business-related facts
Fact identifiers, dimension keys and uniqueness
Dimension roles