BI Unit-II DWH
BI Unit-II DWH
BI Unit-II DWH
Time –
variant
Nonvolatile
Introduction
Where is it used?
It is used for evaluating future strategy.
It needs a successful technician:
Flexible.
Team player.
Good balance of business and technical understanding.
Introduction-Cont’d.
The ultimate use of data warehouse is Mass Customization.
For example, it increased Capital One’s customers from
1 million to approximately 9 millions in 8 years.
Just like a muscle: DW increases in strength with active use.
With each new test and product, valuable information is
added to the DW, allowing the analyst to learn from the
success and failure of the past.
The key to survival:
Is the ability to analyze, plan, and react to changing
business conditions in a much more rapid fashion.
Data Warehouse
In order for data to be effective, DW must be:
Consistent.
Well integrated.
Well defined.
Time stamped.
DW environment:
The data store, data mart & the metadata.
Bill Inmon, who provided the
following:
“A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of
management’s decision making process.”
Subject-Oriented:
A data warehouse can be used to analyze a
particular subject area. For example, “sales” can
be a particular subject.
Integrated:
A data warehouse integrates data from multiple
data sources.
For example, source A and source B may have different ways of
identifying a product, but in a data warehouse, there will be only a
single way of identifying a product.
Time-Variant:
Historical data is kept in a data warehouse.
For example, one can retrieve data from 3 months,
6 months, 12 months, or even older data from a data
warehouse.
This contrasts with a transactions system, where
often only the most recent data is kept.
For example, a transaction system may hold the
most recent address of a customer, where a data
warehouse can hold all addresses associated with a
customer.
Non-volatile:
Once data is in the data warehouse, it will not
change. So, historical data in a data warehouse
should never be altered.
Ralph Kimball provided a more concise
definition of a data warehouse:
i. Subject-oriented :
The warehouse organizes data around the essential subjects of
the business (customers and products) rather than around
applications such as inventory management or order
processing.
i.Integrated:
It is consistent in the way that data from several sources is
extracted and transformed. For example, coding conventions
are standardized: M _ male, F _ female.
ii. Time-variant:
Data are organized by various time-periods (e.g. months).
iii. Non-volatile:
The warehouse’s database is not updated
in real time. There is periodic bulk
uploading of transactional and other data.
This makes the data less subject to
momentary change. There are a number of
steps and processes in building a
warehouse.
What Is a Data Warehouse Used For?
Here, are most common sectors where Data warehouse is used:
Airline:
In the Airline system, it is used for operation purpose like crew
assignment, analyses of route profitability, frequent flyer
program promotions, etc.
Banking:
It is widely used in the banking sector to manage the resources
available on desk effectively. Few banks also used for the
market research, performance analysis of the product and
operations.
Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient's
treatment reports, share data with tie-in insurance companies, medical aid services, etc.
Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyze tax records, health policy records, for every individual.
Investment and Insurance sector:
In this sector, the warehouses are primarily used to analyze data patterns, customer trends, and to
track market movements.
Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing policy.
Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.
Data Warehouse Tools
There are many Data Warehousing tools are available in the market. Here, are some most
prominent one:
1. MarkLogic:
MarkLogic is useful data warehousing solution that makes data integration easier and faster using
an array of enterprise features. This tool helps to perform very complex search operations. It can
query different types of data like documents, relationships, and metadata.
http://developer.marklogic.com/products
2. Oracle:
Oracle is the industry-leading database. It offers a wide range of choice of data warehouse solutions
for both on-premises and in the cloud. It helps to optimize customer experiences by increasing
operational efficiency.
https://www.oracle.com/index.html
3. Amazon RedShift:
Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to analyze all types of
data using standard SQL and existing BI tools. It also allows running complex queries against
petabytes of structured data, using the technique of query optimization.
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Datawarehouse Tools.
Benefits of a Data Warehouse:
This architectural complexity provides the opportunity to:
a. Maintain data history, even if the source transaction systems do not.
b. Integrate data from multiple source systems, enabling a central view
across the enterprise. This benefit is always valuable, but particularly so
when the organization has grown by merger.
c. Improve data, by providing consistent codes and descriptions,
flagging or even fixing bad data.
d. Present the organization’s information consistently.
e. Provide a single common data model for all data of interest
regardless of the data’s source.
f. Restructure the data so that it makes sense to the business users.
g. Restructure the data so that it delivers excellent query performance,
even for complex analytic queries, without impacting the operational
systems.
h. Add value to operational business applications, notably customer
relationship management (CRM) systems.
Dimensions of Data Warehouse:
A dimension is a data element that categorizes each item in a data set into
non-overlapping regions. A data warehouse dimension provides the means
to “slice and dice” data in a data warehouse. Dimensions provide structured
labeling information to otherwise unordered numeric measures. For
example, “Customer”, “Date”, and “Product” are all dimensions that could
be applied meaningfully to a sales receipt. A dimensional data element is
similar to a categorical variable in statistics.
The primary function of dimensions is threefold: to provide filtering,
grouping and labeling. For example, in a data warehouse where each person
is categorized as having a gender of male, female or unknown, a user of the
data warehouse would then be able to filter or categorize each presentation
or report by either filtering based on the gender dimension or displaying
results broken out by the gender.
Each dimension in a data warehouse may have one or more hierarchies
applied to it. For the “Date” dimension, there are several possible
hierarchies: “Day > Month > Year”, “Day > Week > Year”, “Day > Month >
Quarter > Year”, etc.
Business Intelligence
Business Intelligence is a term commonly associated with
data warehousing.
In fact, many of the tool vendors position their products
as business intelligence software rather than data
warehousing software.
There are other occasions where the two terms are used
interchangeably.
Business intelligence usually refers to the information that is
available for the enterprise to make decisions on.
A data warehousing (or data mart) system is the backend, or the
infrastructural, component for achieving business intelligence.
Business intelligence also includes the insight gained from doing
data mining analysis, as well as unstructured data (thus the
need for content management systems).
DATA
WAREHOUSING
&
DATA MINING
Components of Data Warehouse:
The data warehouse is based on an RDBMS server which is a central
information repository that is surrounded by some key components to
make the entire environment functional, manageable and accessible
There are mainly five components of Data Warehouse:
There are 5 main components of a Data warehouse.
1. Data Warehouse Database
2. Sourcing, Acquisition, Clean-up and Transformation Tools
(ETL)
3. Metadata
4. Query Tools
5. Data Marts
Three parts of the data
warehouse
The data warehouse that contains the data and
associated software
Data acquisition (back-end) software that
extracts data from legacy systems and external
sources, consolidates and summarizes them, and
loads them into the data warehouse
Client (front-end) software that allows users to
access and analyze data from the warehouse
35
Data Warehouse Architectures
Two-tier architecture
Two-layer architecture separates physically
available sources and data warehouse.
This architecture is not expandable and also
not supporting a large number of end-users.
It also has connectivity problems because of
network limitations.
Architecture of a two tier data
warehouse
38
3- Tier Data
Warehouse
Architecture
Architecture of a
three-tier data
warehouse
40
3-Tier Data
Warehouse
Architecture
Data ware house adopt a three tier architecture.
Data Extraction
Data Cleaning
Data Transformation
Load
Refresh
Bottom Tier Contains:
Data warehouse
Metadata Repository
Data Marts
Monitoring and
Administration
Metadata repository:
Dependent Independent
sourced directly sourced from one or
from data warehouse more data sources
Monitoring & Administration:
Data Refreshment
Data source synchronization
Disaster recovery
Managing access control and security
Manage data growth, database performance
Controlling the number & range of queries
Limiting the size of data warehouse
Bottom Tier: Data
Monitoring Administration Warehouse Server
Data
Data Marts
Metadata Warehouse
Repository
Data
Sourc e
A B C
Middle Tier: OLAP
Server
It presents the users a multidimensional data from
data warehouse or data marts.
Typically implemented using two models:
Relational OLAP
ROLAP servers are placed between the relational back-end server
and client front-end tools.
Moreover, To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes the following:
Implementation of aggregation navigation logic.
Optimization for each DBMS back end.
Additional tools and services
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of
data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Moreover, Many MOLAP servers use two levels of data storage representation to handle
dense and sparse data sets.
Hybrid OLAP (HOLAP)
Hybrid OLAP is a combination of both ROLAP and MOLAP.
Also, It offers higher scalability of ROLAP and faster computation of MOLAP.
Moreover, HOLAP servers allow storing the large data volumes of detailed information.
The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Also, Specialized SQL servers provide advanced query language and query processing support
for SQL queries over star and snowflake schemas in a read-only environment.
Top Tier: Front end tools
The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools.
It is front end client layer.
Query and reporting tools
64
Alternative Data Warehouse
Architectures:
67
Alternative Data Warehouse
Architectures:
•Distributed Data Warehouse Architecture
Federated -is a concession to
the natural forces that
undermine the best plans for
developing a perfect-system. It
uses all possible means to
integrate analytical resources
from multiple sources to meet
changing needs or business
conditions. Essentially the
federated approach involves
integrating disparate systems.
Good for supplementing data
68 warehouses but not replacing
them
Alternative Architectures for Data Warehouse
Efforts
69
Teradata Corp.’s
EDW
70
The alternative data warehouse architectures and their basic descriptions:
Independent data marts –arguably the simplest and the least costly
architecture alternative, data marts are developed to operate
independently of each other to serve for the needs of individual
organizational units.
Data mart bus –individual marts are linked to each other via some kind
of middle ware
Hub-and-spoke –attention is focused on building a scalable and
maintainable infrastructure, this allows for easy and customization of user
interfaces and reports
Centralized enterprise data warehouse –similar to hub and spoke
but there are no dependent data marts but instead a gigantic enterprise
DW that serves for all the needs of all organizational units. Provides users
to all data in the DW instead of limiting them to justthe data marts.
Ten factors that potentially affect the
architecture selection
decision:
The ETL process seems quite straight forward. As with every application, there is a
possibility that the ETL process fails. This can be caused by missing extracts from one of
the systems, missing values in one of the reference tables, or simply a connection or
power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery
in mind.
Staging
It should be possible to restart, at least, some of the phases independently from the
others and this can be ensured by implementing proper staging.
Staging means that the data is simply dumped to the location (called the Staging Area) so
that it can then be read by the next processing phase.
The staging area is also used during ETL process to store intermediate results of
processing. However, the staging area should is to be accessed by the load ETL process
only.
It should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the end-user as it may contain incomplete or in-the-
middle-of-the-processing data.
ETL Tool Implementation
When you are about to use an ETL tool, there is a fundamental decision to
be made:
Will the company build its own data transformation tool or will it use an
existing tool?
Building your own data transformation tool (usually a set of shell scripts) is
the preferred approach for a small number of data sources which reside in
storage of the same type. The reason for that is the effort to implement the
necessary transformation is little due to similar data structure and common
system architecture
There are many ready-to-use ETL tools on the market. The main benefit of
using off-the-shelf ETL tools is the fact that they are optimized for the ETL
process by providing connectors to common data sources like databases, flat
files, mainframe systems, xml, etc. They provide a means to implement data
transformations easily and consistently across various data sources.
The tools also support transformation scheduling, version control,
monitoring and unified metadata management. Some of the ETL tools are
even integrated with BI tools.
The most well known commercial tools are
Ab Initio,
IBM InfoSphere DataStage,
Informatica,
Oracle Data Integrator and
SAP Data Integrator.
There are several open source ETL tools, among others
Apatar, CloverETL, Pentaho and Talend.
Industry News
Critical Factors for Cloud Deployments: Agility, Flexibility and
Scalability the Reasons for Choosing Teradata
This is where information about the data stored in the data warehouse
system is stored.