Datawarehouse Intro Slides
Datawarehouse Intro Slides
Kimball
1
What is a data warehouse?
Data warehouses began in the 70’s out of the need of many
companies to combine the data of it’s various operational
systems into a useful and consistent form for analysis.
3
Basic Elements of Data Warehouse
4
Core elements of the Kimball
DW/BI architecture
5
High level technical architecture
for data warehouse
6
Key Definitions
• Data mart is a specific, subject-oriented
repository of data that was designed to answer
specific questions
– Usually, multiple data marts exist to serve the needs
of multiple business units (sales, marketing,
operations, collections, accounting, etc.)
• Data warehouse is a single organizational
repository of enterprise wide data across many
or all subject areas.
– Data warehouse is an enterprise wide collection of
data marts
7
Key Definitions
• “Business Intelligence” refers to reporting
and analysis of data stored in the
warehouse
• Data warehouse is the foundation for
business intelligence.
• ‘‘Data warehouse/business intelligence’’
(DW/BI) refers to the complete end-to-end
system.
8
Two Main Data Warehouse
Development Methodologies
• Top-down approach
– The Inmon’s approach
– DW is developed based on the Enterprise wide data model
– DW as a single repository feeds data into data marts
– Longer to implement
• May fail due to the lack of patience and commitment
• Bottom-up approach
– The Kimball’s approach
– Starts with one data mart (ex. sales); later on additional data marts
are added (ex. collection, marketing, etc.)
– Data flows from source into data marts, then into the data warehouse
– Faster to implement
• Implementation in stages
– Need to ensure consistency of metadata
• Making sure each data mart calls Apple and Apple
• The Hybrid approach 9
The Kimball Lifecycle Diagram
10
The Kimball Lifecycle
• Illustrates the general flow of a DW
implementation
• Identifies task sequencing and highlights
activities that should happen concurrently
• May need to be customized to address the
unique needs of your organization
• Not every detail of every Lifecycle task will
be performed on every project
11
The Kimball Lifecycle,
SDLC, and DBLC
DB Design
Analysis
Implementation
Detailed
System
Design Testing
Implementatio
n Operation
Maintenance Maintenance
12
Program/Project Planning
• Kimball’s view of programs and projects
– Project refers to a single iteration of the Kimball
Lifecycle
• from launch through deployment
– Program refers to the broader, ongoing coordination
of resources, infrastructure, timelines, and
communication across multiple projects
• a program contains multiple projects
– In real world, programs do not necessarily start before
projects although ideally they should be.
13
Program/Project Planning
• Project planning
– Scope definition ⬄understanding business
requirements
– Tasks’ identification
– Scheduling
– Resource planning
– Workload assignment
– The end document represents a blueprint of
the project
14
Program/Project Management
• Enforces the project plan
• Activities:
– Status monitoring
– Issue tracking
– Development of a comprehensive
communication plan that addresses both the
business and IT units
15
Business Requirements Definition
• Success of the project depends on a solid
understanding of the business
requirements!!!
• Understanding the key factors driving the
business is crucial for successful
translation of the business requirements
into design considerations
16
What follows the business
requirements definition?
• 3 concurrent tracks focusing on
– Technology
– Data
– Business intelligence applications
– Arrows in the diagram indicate the activity
workflow along each of the parallel tracks
– Dependencies between the tasks are
illustrated by the vertical alignment of the task
boxes.
17
Technology Track
• Technical Architecture Design
– Overall architectural framework and vision
– Considerations:
• the business requirements
• current technical environment
• planned strategic technical directions
18
Technology Track
• Product Selection and Installation
– Based on the designed technical architecture
• Evaluation and selection of
– Products that will deliver needed capabilities
– Hardware platform
– Database management system
– Extract-transformation-load (ETL) tools
– Data access query tools
– Reporting tools must be evaluated
• Installation of selected products/components/tools
• Testing of installed products to ensure appropriate
end-to-end integration within the data warehouse
environment.
19
Data Track
20
Dimensional Modeling
• Detailed data analysis of a single business
process is performed to identify the fact table
granularity, associated dimensions and
attributes, and numeric facts.
• Dimensional models contain the same data
content and relationships as models normalized
into third normal form, but structured differently.
– Improve understandability and query performance
required by DW/BI
• Primary constructs of a dimensional model
– fact tables
– dimension tables
21
Dimensional Modeling
• Fact tables
– Contain the metrics resulting from a business process
or measurement event, such as the sales ordering
process or service call event
– Dimensional models should be structured around
business processes and their associated data
sources,
• This results in ability to design identical, consistent views of
data for all observers, regardless of which business unit they
belong to, which goes a long way toward eliminating
misunderstandings at business meetings
– Fact table’s granularity should be set at the lowest,
most atomic level captured by the business process
• This allows for maximum flexibility and extensibility.
– Business users will be able to ask constantly changing,
free-ranging, and very precise questions.
22
Dimensional Modeling
• Dimensional table
– Contain the descriptive attributes and characteristics
associated with specific, tangible measurement
events, such as the customer, product, or sales
representative associated with an order being
placed.
– Dimension attributes are used for constraining,
grouping, or labeling in a query.
– Hierarchical many-to-one relationships are
denormalized into single dimension tables.
23
Star Schema
• A fact table
• Multiple dimension tables
• Example: Assume this schema to be of a retail-chain. Fact will
be revenue (money). How do you want to see data is called a
dimension.
24
Snowflake Schema
• The snowflake schema is a variation of the star schema used in a data
warehouse.
• The snowflake schema is a more complex schema than the star schema
because the tables which describe the dimensions are normalized.
25
Snowflake Schema
• Disadvantages:
– Fact tables are typically responsible for 90% or more of the
storage requirements, so the benefit is normally insignificant.
– Normalization of the dimension tables ("snowflaking") can impair
the performance of a data warehouse.
• Advantages:
– If a dimension is very sparse (i.e. most of the possible values for
the dimension have no data) and/or a dimension has a very long
list of attributes which may be used in a query, the dimension
table may occupy a significant proportion of the database and
snowflaking may be appropriate.
• In practice, many data warehouses will normalize some
dimensions and not others, and hence use a
combination of snowflake and classic star schema.
26
Physical Design
• Defining the physical structures
– setting up the database environment
– Setting up appropriate security
– preliminary performance tuning strategies,
from indexing to partitioning and aggregations.
– If appropriate, OLAP databases are also
designed during this process.
27
ETL Design and Development
• The MOST important stage
• 70% of the risk and effort in the DW
project is attributed to this stage
• ETL system capabilities:
– Extraction
– Cleansing and conforming
– Delivery and management
28
ETL
• Raw data is extracted from the operational
source systems and is being transformed into
meaningful information for the business
• ETL processes must be architected long before
any data is extracted from the source
• ETL system strives to deliver high throughput, as
well as high quality output
• Incoming data is checked for reasonable quality
• Data quality conditions are continuously
monitored
• Kimball calls ETL a “data warehouse back room”
29
Business Intelligence
Application Track
• Applications that query, analyze, and present information
from the dimensional model.
• BI applications deliver business value from the DW/BI
solution, rather than just delivering the data
• The goal is to deliver capabilities that are accepted by
the business to support and enhance their decision
making.
• BI Application Design
– Identify the candidate BI applications and appropriate navigation
interfaces to address the users’ needs and needed capabilities.
– Produce BI application specification
• BI Application Development
– Configuration of the business metadata and tool infrastructure
– Construction and validation of the specified analytic and
operational BI applications and the navigational portal
30
Deployment
• It is crucial that adequate planning was
performed to make sure that:
– the results of technology, data, and BI application
tracks are tested and fit together properly
– Appropriate education and support infrastructure is in
place.
• It is critical that deployment be well orchestrated
• Deployment should be deferred if all the pieces,
such as training, documentation, and validated
data, are not ready for production release.
31
Maintenance
• Occurs when the system is in production
• Includes:
– technical operational tasks that are necessary
to keep the system performing optimally
• usage monitoring
• performance tuning
• index maintenance
• system backup
– Ongoing support, education, and
communication with business users
32
Growth
• DW systems tend to expand (if they were
successful)
– Is considered as a sign of success
– New requests need to be prioritized
– Starting the cycle again
• Building upon the foundation that has already been
established
• Focusing on the new requirements
33