Data Warehousing Fundamentals

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 108

Data Warehousing Fundamentals

Syllabus
Introduction to Data Warehouse, Data warehouse architecture,
Data warehouse versus Data Marts, E-R Modeling versus
Dimensional Modeling, Information Package Diagram, Data
Warehouse Schemas; Star Schema, Snowflake Schema, Factless
Fact Table, Fact Constellation Schema. Update to the dimension
tables. Major steps in ETL process, OLTP versus OLAP, OLAP
operations: Slice, Dice, Rollup, Drilldown and Pivot.
Introduction to Data Warehouse
• Data Warehouse is a relational database management
system (RDBMS) construct to meet the requirement of
transaction processing systems.
• It can be loosely described as any centralized data repository
which can be queried for business benefits.
• It is a database that stores information oriented to satisfy
decision-making requests.
• It is a group of decision support technologies, targets to
enabling the knowledge worker (executive, manager, and
analyst) to make superior and higher decisions.
• So, Data Warehousing support architectures and tool for
business executives to systematically organize, understand
and use their information to make strategic decisions.
What is a Data Warehouse?
• A Data Warehouse (DW) is a relational
database that is designed for query and
analysis rather than transaction processing.
• It includes historical data derived from
transaction data from single and multiple
sources.
• A Data Warehouse provides integrated,
enterprise-wide, historical data and focuses on
providing support for decision-makers for data
modeling and analysis.
What is a Data Warehouse?
• A Data Warehouse is a group of data specific to the entire organization, not
only to a particular group of users.
• It is not used for daily operations and transaction processing but used for
making decisions.
• A Data Warehouse can be viewed as a data system with the following
attributes:
• It is a database designed for investigative tasks, using data from various
applications.
• It supports a relatively small number of clients with relatively long
interactions.
• It includes current and historical data to provide a historical perspective of
information.
• Its usage is read-intensive.
• It contains a few large tables.
• "Data Warehouse is a subject-oriented, integrated, and time-variant store
of information in support of management's decisions."
Characteristics of Data Warehouse

• Subject-Oriented
• A data warehouse target on the modeling and
analysis of data for decision-makers.
• Therefore, data warehouses typically provide a
concise and straightforward view around a
particular subject, such as customer, product, or
sales, instead of the global organization's ongoing
operations.
• This is done by excluding data that are not useful
concerning the subject and including all data
needed by the users to understand the subject.
Characteristics of Data Warehouse
Characteristics of Data Warehouse

• Integrated
• A data warehouse integrates various
heterogeneous data sources like RDBMS, flat
files, and online transaction records.
• It requires performing data cleaning and
integration during data warehousing to ensure
consistency in naming conventions, attributes
types, etc., among different data sources
Characteristics of Data Warehouse
Characteristics of Data Warehouse
• Time-Variant
• Historical information is kept in a data
warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even
previous data from a data warehouse.
• These variations with a transactions system,
where often only the most current file is kept.
Characteristics of Data Warehouse
• Non-Volatile
• The data warehouse is a physically separate data storage,
which is transformed from the source operational RDBMS.
• The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not
performed.
• It usually requires only two procedures in data accessing:
Initial loading of data and access to data. Therefore, the DW
does not require transaction processing, recovery, and
concurrency capabilities, which allows for substantial
speedup of data retrieval.
• Non-Volatile defines that once entered into the warehouse,
and data should not change.
Characteristics of Data Warehouse
Need for Data Warehouse
• Data Warehouse is needed for the following reasons:
• 1) Business User: Business users require a data warehouse to view
summarized data from the past. Since these people are non-technical, the
data may be presented to them in an elementary form.
• 2) Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various
purposes.
• 3) Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making
strategic decisions.
• 4) For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring the
uniformity and consistency in data.
• 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree
of flexibility and quick response time.
Benefits of Data Warehouse

• Better business analytics: Data warehouse plays an important


role in every business to store and analysis of all the past data
and records of the company. which can further increase the
understanding or analysis of data to the company.
• Faster Queries: Data warehouse is designed to handle large
queries that’s why it runs queries faster than the database.
• Improved data Quality: In the data warehouse the data you
gathered from different sources is being stored and analyzed
it does not interfere with or add data by itself so your quality
of data is maintained and if you get any issue regarding data
quality then the data warehouse team will solve this.
• Historical Insight: The warehouse stores all your historical
data which contains details about the business so that one
can analyze it at any time and extract insights from it
Difference between Database and Data Warehouse
• Database System: Database System is used in traditional way of
storing and retrieving data.
• The major task of database system is to perform query processing.
These systems are generally referred as online transaction processing
system.
• These systems are used for day to day operations of any organization.
• Data Warehouse: Data Warehouse is the place where huge amount
of data is stored.
• It is meant for users or knowledge workers in the role of data analysis
and decision making.
• These systems are supposed to organize and present data in different
format and different forms in order to serve the need of the specific
user for specific purpose.
• These systems are referred as online analytical processing.
Difference between Database and Data Warehouse
Example Applications of Data Warehousing

• Data Warehousing can be applied anywhere where we have a huge


amount of data and we want to see statistical results that help in
decision making.
• Social Media Websites: The social networking websites like
Facebook, Twitter, Linkedin, etc. are based on analyzing large data
sets. These sites gather data related to members, groups, locations,
etc., and store it in a single central repository. Being a large amount
of data, Data Warehouse is needed for implementing the same.
• Banking: Most of the banks these days use warehouses to see the
spending patterns of account/cardholders. They use this to provide
them with special offers, deals, etc.
• Government: Government uses a data warehouse to store and
analyze tax payments which are used to detect tax thefts.
• Advantages of Data Warehouse:
• Improved data quality: Data warehousing can help improve
data quality by consolidating data from various sources into a
single, consistent view.
• Faster access to information: Data warehousing enables
quick access to information, allowing businesses to make
better, more informed decisions faster.
• Better decision-making: With a data warehouse, businesses
can analyze data and gain insights into trends and patterns
that can inform better decision-making.
• Reduced data redundancy: By consolidating data from
various sources, data warehousing can reduce data
redundancy and inconsistencies.
• Scalability: Data warehousing is highly scalable and can
handle large amounts of data from different sources.
• Disadvantages:
• Cost: Building a data warehouse can be expensive, requiring
significant investments in hardware, software, and personnel.
• Complexity: Data warehousing can be complex, and businesses
may need to hire specialized personnel to manage the system.
• Time-consuming: Building a data warehouse can take a
significant amount of time, requiring businesses to be patient
and committed to the process.
• Data integration challenges: Data from different sources can
be challenging to integrate, requiring significant effort to
ensure consistency and accuracy.
• Data security: Data warehousing can pose data security risks,
and businesses must take measures to protect sensitive data
from unauthorized access or breaches.
Components of Data warehouse
Components of Data warehouse

• The Source Data component shows various internal & external data
sources such as Production Data, Internal Data, Archived Data, External
Data.
• The extracted data coming from several different sources need to be
changed, converted, and made ready in a format that is relevant to be
saved for querying and analysis. The Data staging element serves as the
next building block and performs all these tasks.
• In the middle, we see the Data Storage component that handles the data
warehouses data. This element not only stores and manages the data; it
also keeps track of data using the metadata repository.
• The Information Delivery component shows on the right consists of all
the different ways of making the information from the data warehouses
available to the users.
• The management and control elements coordinate the services and
functions within the data warehouse. These components control the
data transformation and the data transfer into the data warehouse
storage.
Data warehouse architecture

• A data warehouse architecture is a method of defining


the overall architecture of data communication
processing and presentation that exist for end-clients
computing within the enterprise.
• Each data warehouse is different, but all are
characterized by standard vital components.
• Data warehouses and their architectures very depending
upon the elements of an organization's situation.
• Three common architectures are:
• Data Warehouse Architecture: Basic
• Data Warehouse Architecture: With Staging Area
• Data Warehouse Architecture: With Staging Area and
Data Marts
Properties of Data Warehouse Architectures

• 1. Separation: Analytical and transactional processing


should be keep apart as much as possible.
• 2. Scalability: Hardware and software architectures should
be simple to upgrade the data volume, which has to be
managed and processed, and the number of user's
requirements, which have to be met, progressively increase.
• 3. Extensibility: The architecture should be able to perform
new operations and technologies without redesigning the
whole system.
• 4. Security: Monitoring accesses are necessary because of
the strategic data stored in the data warehouses.
• 5. Administer ability: Data Warehouse management should
not be complicated.
Types of Data Warehouse Architectures
Types of Data Warehouse Architectures
• Single-Tier architecture is not periodically used in
practice. Its purpose is to minimize the amount of data
stored to reach this goal; it removes data redundancies.
• Two-layer architecture highlight a separation between
physically available sources and data warehouses, in
fact, it consists of four subsequent data flow
stages:Source layer, Data Staging, Data Warehouse
layer, Analysis.
• The three-tier architecture consists of the source layer
(containing multiple source system), the reconciled
layer and the data warehouse layer (containing both
data warehouses and data marts). The reconciled layer
sits between the source data and data warehouse.
Single-Tier Architecture
Two-Tier Architecture
Three-Tier Architecture
Three-Tier Data Warehouse Architecture
• Data Warehouses usually have a three-level (tier)
architecture that includes:
• Bottom Tier (Data Warehouse Server)
• Middle Tier (OLAP Server)
• Top Tier (Front end Tools).
• A bottom-tier that consists of the Data Warehouse
server, which is almost always an RDBMS. It may include
several specialized data marts and a metadata repository.
• Data from operational databases and external sources
(such as user profile data provided by external
consultants) are extracted using application program
interfaces called a gateway. A gateway is provided by the
underlying DBMS and allows customer programs to
generate SQL code to be executed at a server.
Three-Tier Data Warehouse Architecture

• A middle-tier which consists of an OLAP server for fast


querying of the data warehouse.
• The OLAP server is implemented using either
• (1) A Relational OLAP (ROLAP) model, i.e., an extended
relational DBMS that maps functions on multidimensional
data to standard relational operations.
• (2) A Multidimensional OLAP (MOLAP) model, i.e., a
particular purpose server that directly implements
multidimensional information and operations.
• A top-tier that contains front-end tools for displaying
results provided by OLAP, as well as additional tools for
data mining of the OLAP-generated data.
Data warehouse versus Data Marts
What is Data Mart?

• Data Marts are analytical record stores designed to


focus on particular business functions for a specific
community within an organization.
• Data marts are derived from subsets of data in a data
warehouse, though in the bottom-up data warehouse
design methodology, the data warehouse is created
from the union of organizational data marts.
• The fundamental use of a data mart is Business
Intelligence (BI) applications. BI is used to gather, store,
access, and analyze record.
• It can be used by smaller businesses to utilize the data
they have accumulated since it is less expensive than
implementing a data warehouse.
What is Data Mart?
Types of Data Marts
• There are mainly two approaches to designing data marts. These
approaches are
• Dependent Data Marts(Top down Approach):
• According to this technique, the data marts are treated as the
subsets of a data warehouse.
• In this technique, firstly a data warehouse is created from which
further various data marts can be created. These data mart are
dependent on the data warehouse and extract the essential record
from it.
• Independent Data Marts (Bottom up Approach):
• firstly independent data marts are created, and then a data
warehouse is designed using these independent multiple data
marts. In this approach, as all the data marts are designed
independently; therefore, the integration of data marts is required.
Types of Data Marts
Data Warehouse Design
• A data warehouse is a single data repository where a record from
multiple data sources is integrated for online business analytical
processing (OLAP).
• This implies a data warehouse needs to meet the requirements from all
the business stages within the entire organization.
• Thus, data warehouse design is a hugely complex, lengthy, and hence
error-prone process. Furthermore, business analytical functions change
over time, which results in changes in the requirements for the systems.
• Therefore, data warehouse and OLAP systems are dynamic, and the
design process is continuous.
• The target of the design becomes how the record from multiple data
sources should be extracted, transformed, and loaded (ETL) to be
organized in a database as the data warehouse.
• There are two approaches
• "top-down" approach
• "bottom-up" approach
Top-down Design Approach

• In the "Top-Down" design approach, a data warehouse is


described as a subject-oriented, time-variant, non-volatile and
integrated data repository for the entire enterprise data from
different sources are validated, reformatted and saved in a
normalized (up to 3NF) database as the data warehouse.
• The data warehouse stores "atomic" information, the data at
the lowest level of granularity, from where dimensional data
marts can be built by selecting the data required for specific
business subjects or particular departments.
• An approach is a data-driven approach as the information is
gathered and integrated first and then business requirements
by subjects for building data marts are formulated.
• The advantage of this method is which it supports a single
integrated data source. Thus data marts built from it will have
consistency when they overlap.
Top-down Design Approach

• Advantages of top-down design


• Data Marts are loaded from the data
warehouses.
• Developing new data mart from the data
warehouse is very easy.
• Disadvantages of top-down design
• This technique is inflexible to changing
departmental needs.
• The cost of implementing the project is high.
Top-down Design Approach
Bottom-Up Design Approach
• In this approach, a data mart is created first to necessary reporting
and analytical capabilities for particular business processes (or
subjects). Thus it is needed to be a business-driven approach in
contrast to data-driven approach.
• Data marts include the lowest grain data and, if needed,
aggregated data too. Instead of a normalized database for the data
warehouse, a denormalized dimensional database is adapted to
meet the data delivery requirements of data warehouses.
• Using this method, to use the set of data marts as the enterprise
data warehouse, data marts should be built with conformed
dimensions in mind, defining that ordinary objects are represented
the same in different data marts.
• The conformed dimensions connected the data marts to form a
data warehouse, which is generally called a virtual data
warehouse.
Bottom-Up Design Approach
Bottom-Up Design Approach

• Advantages of bottom-up design


• Documents can be generated quickly.
• The data warehouse can be extended to
accommodate new business units.
• It is just developing new data marts and then
integrating with other data marts.
• Disadvantages of bottom-up design
• the locations of the data warehouse and the
data marts are reversed in the bottom-up
approach design.
Data Warehouse Modeling

• Data warehouse modeling is the process of designing the


schemas of the detailed and summarized information of the
data warehouse.
• The goal of data warehouse modeling is to develop a schema
describing the reality, or at least a part of the fact, which the
data warehouse is needed to support.
• Data warehouse modeling is an essential stage of building a
data warehouse for two main reasons. Firstly, through the
schema, data warehouse clients can visualize the relationships
among the warehouse data, to use them with greater ease.
• Secondly, a well-designed schema allows an effective data
warehouse structure to emerge, to help decrease the cost of
implementing the warehouse and improve the efficiency of
using it.
Data Modeling Life Cycle
• It is a straight forward process of transforming the business
requirements to fulfill the goals for storing, maintaining, and
accessing the data within IT systems. The result is a logical and
physical data model for an enterprise data warehouse.
• The objective of the data modeling life cycle is primarily the
creation of a storage area for business information. That area
comes from the logical and physical data modeling stages, as shown
in Figure:
E-R Modeling versus Dimensional Modeling
• ER model is used for logical representation or the
conceptual view of data.
• It is a high level of the conceptual data model.
• It forms a virtual representation of data that describes
how all the data are related to each other.
• It is a complex diagram that is used to represent multiple
processes.
• It helps to describe entities, attributes, and relationships.
• It helps to analyze data requirements systematically to
produce a well-designed database.
• At the view level, the ER model is considered a good
option for designing databases.
E-R Modeling versus Dimensional Modeling
• Data in a warehouse are usually in the multidimensional form.
• Dimensional Modeling prefers keeping the table denormalized.
• The primary purpose of dimensional modeling is to optimize the
database for faster retrieval of the data.
• The concept of Dimensional Modelling was developed by Ralph
Kimball and consists of “fact” and “dimension” tables. The primary
purpose of dimensional modeling is to enable business intelligence
(BI) reporting, query, and analysis.
• Dimensional modeling is a form of modeling of data that is more
flexible from the perspective of the user. These dimensional and
relational models have their unique way of data storage that has
specific advantages.
• Dimensional models are built around business processes. They
need to ensure that dimension tables use a surrogate key.
• Dimension tables store the history of the dimensional information.
Difference between ER Modeling and Dimensional Modeling
Information Package diagram
• It is a novel idea for determining and recording information
requirements for a data warehouse.
• This concept helps us to give a concrete form to the various
insights, nebulous thoughts, and opinions expressed during the
process of collecting requirements.
• The information packages are very useful for taking the
development of the data warehouse to the next phases.
• when requirements cannot be fully determined, we need a new
and innovative concept to gather and record the requirements.
• The new methodology for determining requirements for a data
warehouse system is based on business dimensions.
• It flows out of the need of the users to base their analysis on
business dimensions. The new concept incorporates the basic
measurements and the business dimensions along which the users
analyze these basic measurements.
Information Package diagram
Information Package Diagram
• The measured facts or the measurements that are of interest for analysis
are shown in the bottom section of the package diagram.
• In this case, the measurements are actual sales, forecast sales, and budget
sales.
• The business dimensions along which these measurements are to be
analyzed are shown at the top of diagram as column headings.
• In this example, these dimensions are time, location, product, and
demographic age group. Each of these business dimensions contains a
hierarchy or levels. For example, the time dimension has the hierarchy
going from year down to the level of individual day. The other intermediary
levels in the time dimension could be quarter, month, and week.
• These levels or hierarchical components are shown in the information
package diagram.
• Your primary goal in the requirements definition phase is to compile
information packages for all the subjects for the data warehouse. Once you
have firmed up the information packages, you’ll be able to proceed to the
other phases.
Information Package Diagram

• Information packages enable you to:


• Define the common subject areas
• Design key business metrics
• Decide how data must be presented
• Determine how users will aggregate or roll up
• Decide the data quantity for user analysis or query
• Decide how data will be accessed
• Establish data granularity
• Estimate data warehouse size
• Determine the frequency for data refreshing
• Ascertain how information must be packaged
Data Warehousing - Schemas
• Schema is a logical description of the entire
database.
• It includes the name and description of
records of all record types including all
associated data-items and aggregates.
• Much like a database, a data warehouse also
requires to maintain a schema.
• A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact
Constellation schema
Data Warehouse Schemas

• Star and snowflake schema designs are


mechanisms to separate facts and dimensions into
separate tables.
• Snowflake schemas further separate the different
levels of a hierarchy into separate tables.
• In either schema design, each table is related to
another table with a primary key/foreign key
relationship.
• Primary key/foreign key relationships are used in
relational databases to define many-to-one
relationships between tables.
What is Star Schema?

• A star schema is the elementary form of a dimensional model, in


which data are organized into facts and dimensions.
• A fact is an event that is counted or measured, such as a sale or
log in. A dimension includes reference data about the fact, such
as date, item, or customer.
• A star schema is a relational schema where a relational schema
whose design represents a multidimensional data model. The
star schema is the explicit data warehouse schema.
• It is known as star schema because the entity-relationship
diagram of this schemas simulates a star, with points, diverge
from a central table. The center of the schema consists of a
large fact table, and the points of the star are the dimension
tables.
What is Star Schema?
Star Schema
• The following diagram shows the sales data of a
company with respect to the four dimensions,
namely time, item, branch, and location.
Fact Tables
• It is a table in a star schema which contains facts and
connected to dimensions.
• A fact table has two types of columns: those that include
fact and those that are foreign keys to the dimension table.
• The primary key of the fact tables is generally a composite
key that is made up of all of its foreign keys.
• A fact table might involve either detail level fact or fact
that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables).
• A fact table generally contains facts with the same level of
aggregation.
Dimension Tables
• A dimension is an architecture usually composed of one
or more hierarchies that categorize data. If a dimension
has not got hierarchies and levels, it is called a flat
dimension or list.
• The primary keys of each of the dimensions table are
part of the composite primary keys of the fact table.
• Dimensional attributes help to define the dimensional
value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact
table.
• Fact tables store data about sales while dimension tables
data about the geographic region (markets, cities),
clients, products, times, channels.
Characteristics of Star Schema

• The star schema is intensely suitable for data


warehouse database design because of the following
features:
• It creates a DE-normalized database that can quickly
provide query responses.
• It provides a flexible design that can be changed easily
or added to throughout the development cycle, and
as the database grows.
• It provides a parallel in design to how end-users
typically think of and use the data.
• It reduces the complexity of metadata for both
developers and end-users.
Star Schema
• Advantages of Star Schema
• Star Schemas are easy for end-users and application to
understand and navigate. With a well-designed schema, the
customer can instantly analyze large, multidimensional data sets.
• The main advantage of star schemas in a decision-support
environment are:
• Query Performance
• Load performance and administration
• Built-in referential integrity
• Easily Understood
• Disadvantage of Star Schema
• There is some condition which cannot be meet by star schemas
like the relationship between the user, and bank account cannot
describe as star schema as the relationship between them is
many to many.
Advantages of Star Schema

• Query Performance
• A star schema database has a limited number of table
and clear join paths, the query run faster than they do
against OLTP systems.
• Small single-table queries, frequently of a dimension
table, are almost instantaneous. Large join queries that
contain multiple tables takes only seconds or minutes
to run.
• In a star schema database design, the dimension is
connected only through the central fact table. When
the two-dimension table is used in a query, only one
join path, intersecting the fact tables, exist between
those two tables. This design feature enforces authentic
and consistent query results.
Advantages of Star Schema

• Load performance and administration


• Structural simplicity also decreases the time
required to load large batches of record into a star
schema database.
• By describing facts and dimensions and separating
them into the various table, the impact of a load
structure is reduced.
• Dimension table can be populated once and
occasionally refreshed.
• We can add new facts regularly and selectively by
appending records to a fact table.
Advantages of Star Schema

• Built-in referential integrity


• A star schema has referential integrity built-in
when information is loaded.
• Referential integrity is enforced because each
data in dimensional tables has a unique
primary key, and all keys in the fact table are
legitimate foreign keys drawn from the
dimension table.
• A record in the fact table which is not related
correctly to a dimension cannot be given the
correct key value to be retrieved.
Advantages of Star Schema

• Easily Understood
• A star schema is simple to understand and
navigate, with dimensions joined only through
the fact table.
• These joins are more significant to the end-
user because they represent the fundamental
relationship between parts of the underlying
business.
• Customer can also browse dimension table
attributes before constructing a query.
Star Schema
• Example: Suppose a star schema is composed of a
fact table, SALES, and several dimension tables
connected to it for time, branch, item, and
geographic locations.
• The TIME table has a column for each day, month,
quarter, and year. The ITEM table has columns for
each item_Key, item_name, brand, type,
supplier_type. The BRANCH table has columns for
each branch_key, branch_name, branch_type.
The LOCATION table has columns of geographic
data, including street, city, state, and country.
Star Schema
Star Schema
• In this scenario, the SALES table contains only four
columns with IDs from the dimension tables, TIME, ITEM,
BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for
BRANCH data, and four columns for LOCATION data.
Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a
single change in the dimension table, instead of making
many changes in the fact table.
• We can create even more complex star schemas by
normalizing a dimension table into several tables. The
normalized dimension table is called a Snowflake.
What is Snowflake Schema?
• A snowflake schema is equivalent to the star schema. "A schema is known as a
snowflake if one or more dimension tables do not connect directly to the fact
table but must join through other dimension tables."
• The snowflake schema is an expansion of the star schema where each point of the
star explodes into more points.
• It is called snowflake schema because the diagram of snowflake schema
resembles snowflake .
• Snowflaking is a method of normalizing the dimension tables in a STAR schemas.
When we normalize all the dimension tables entirely, the resultant structure
resembles a snowflake with the fact table in the middle.
• Snowflaking is used to develop the performance of specific queries. The schema is
diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake
pattern.
• The snowflake schema consists of one fact table which is linked to many
dimension tables, which can be linked to other dimension tables through a many-
to-one relationship. Tables in a snowflake schema are generally normalized to the
third normal form. Each dimension table performs exactly one level in a hierarchy.
Snowflake Schema
• The following diagram shows a snowflake schema with two
dimensions, each having three levels. A snowflake schemas
can have any number of dimension, and each dimension can
have any number of levels.
Snowflake Schema
• Example : Unlike Star schema, the dimensions table in a snowflake
schema are normalized. For example, the item dimension table in star
schema is normalized and split into two dimension tables, namely item
and supplier table.
• Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Snowflake Schema
• Advantage of Snowflake Schema
• The primary advantage of the snowflake schema is the
development in query performance due to minimized disk
storage requirements and joining smaller lookup tables.
• It provides greater scalability in the interrelationship between
dimension levels and components.
• No redundancy, so it is easier to maintain.
• Disadvantage of Snowflake Schema
• The primary disadvantage of the snowflake schema is the
additional maintenance efforts required due to the increasing
number of lookup tables. It is also known as a multi fact star
schema.
• There are more complex queries and hence, difficult to
understand.
• More tables more join so more query execution time.
Factless Fact Table
• A fact table contains facts or measures.
• Factless fact tables are only used to establish relationships between
elements of different dimensions.
• No measures or facts are associated with these.
Factless Fact Table
• There are two types of factless table :
• 1. Event Tracking Tables :
• Use a factless fact table to track events of
interest to the organization.
• 2. Coverage Tables :
• It is used to support negative analysis reports.
For example, to create a report that a store
did not sell a product for a certain period of
time, you should have a fact table to capture
all possible combinations. Then you can find
out what is missing.
Fact Constellation Schema.
• A Fact constellation means two or more fact
tables sharing one or more dimensions. It is
also called Galaxy schema.
• Fact Constellation Schema describes a logical
structure of data warehouse or data mart.
• Fact Constellation Schema can design with a
collection of de-normalized FACT, Shared, and
Conformed Dimension tables.
Fact Constellation Schema.
Fact Constellation Schema.
Fact Constellation Schema.
• This schema defines two fact tables, sales, and shipping.
• Sales are treated along four dimensions, namely, time,
item, branch, and location. The schema contains a fact
table for sales that includes keys to each of the four
dimensions, along with two measures: Rupee_sold and
units_sold.
• The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location,
and two measures: Rupee_cost and units_shipped.
• The primary disadvantage of the fact constellation schema
is that it is a more challenging design because many
variants for specific kinds of aggregation must be
considered and selected.
ETL process
• The mechanism of extracting information from source
systems and bringing it into the data warehouse is
commonly called ETL, which stands for Extraction,
Transformation and Loading.
• The ETL process requires active inputs from various
stakeholders, including developers, analysts, testers,
top executives and is technically challenging.
• To maintain its value as a tool for decision-makers,
Data warehouse technique needs to change with
business changes.
• ETL is a recurring method (daily, weekly, monthly) of a
Data warehouse system and needs to be agile,
automated, and well documented.
ETL process
How ETL Works?
• Extraction
• Extraction is the operation of extracting information
from a source system for further use in a data
warehouse environment. This is the first stage of the
ETL process.
• Extraction process is often one of the most time-
consuming tasks in the ETL.
• The source systems might be complicated and poorly
documented, and thus determining which data needs
to be extracted can be difficult.
• The data has to be extracted several times in a
periodic manner to supply all changed data to the
warehouse and keep it up-to-date.
How ETL Works?

• Cleansing
• The cleansing stage is crucial in a data warehouse technique because
it is supposed to improve data quality. The primary data cleansing
features found in ETL tools are rectification and homogenization. They
use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific
rules and defines appropriate associations between values.
• The following examples show the essential of data cleaning:
• If an enterprise wishes to contact its users or its suppliers, a
complete, accurate and up-to-date list of contact addresses, email
addresses and telephone numbers must be available.
• If a client or supplier calls, the staff responding should be quickly able
to find the person in the enterprise database, but this need that the
caller's name or his/her company name is listed in the database.
• If a user appears in the databases with two or more slightly different
names or different account numbers, it becomes difficult to update
the customer's information.
How ETL Works?
• Transformation
• Transformation is the core of the reconciliation phase. It converts
records from its operational source format into a particular data
warehouse format. If we implement a three-layer architecture, this
phase outputs our reconciled data layer.
• The following points must be rectified in this phase:
• Loose texts may hide valuable information. For example, XYZ PVT Ltd
does not explicitly show that this is a Limited Partnership company.
• Different formats can be used for individual data. For example, data
can be saved as a string or as three integers.
• Following are the main transformation processes aimed at populating
the reconciled data layer:
• Conversion and normalization that operate on both storage formats
and units of measure to make data uniform.
• Matching that associates equivalent fields in different sources.
• Selection that reduces the number of source fields and records.
How ETL Works?
• In this step, a set of rules or functions are applied on the
extracted data to convert it into a single standard format.
• It may involve following processes/tasks:
• Filtering – loading only certain attributes into the data
warehouse.
• Cleaning – filling up the NULL values with some default
values, mapping U.S.A, United States, and America into USA,
etc.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute
(generally key-attribute).
How ETL Works?
• Cleansing and Transformation processes are
often closely linked in ETL tools.
How ETL Works?
• Loading
• The Load is the process of writing the data into the target
database. During the load step, it is necessary to ensure that the
load is performed correctly and with as little resources as
possible.
• Loading can be carried in two ways:
• Refresh: Data Warehouse data is completely rewritten. This
means that older file is replaced. Refresh is usually used in
combination with static extraction to populate a data warehouse
initially.
• Update: Only those changes applied to source information are
added to the Data Warehouse. An update is typically carried out
without deleting or modifying preexisting data. This method is
used in combination with incremental extraction to update data
warehouses regularly.
How ETL Works?
• ETL process can also use the pipelining concept i.e. as soon as some data
is extracted, it can transformed and during that period some new data can
be extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed. The block
diagram of the pipelining of ETL process is shown below:
ETL Process
• ETL Tools: Most commonly used ETL tools are Hevo,
Sybase, Oracle Warehouse builder, CloverETL, and
MarkLogic.
• Data Warehouses: Most commonly used Data
Warehouses are Snowflake, Redshift, BigQuery, and
Firebolt.
• ETL process is an essential process in data warehousing
that helps to ensure that the data in the data warehouse
is accurate, complete, and up-to-date. However, it also
comes with its own set of challenges and limitations, and
organizations need to carefully consider the costs and
benefits before implementing them.
What is OLAP (Online Analytical Processing)?
• OLAP stands for On-Line Analytical Processing.
• OLAP is a classification of software technology which authorizes
analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide
variety of possible views of data that has been transformed
from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.
• OLAP implement the multidimensional analysis of business
information and support the capability for complex estimations,
trend analysis, and sophisticated data modeling.
What is OLAP (Online Analytical Processing)?

• It is rapidly enhancing the essential foundation


for Intelligent Solutions containing Business
Performance Management, Planning,
Budgeting, Forecasting, Financial Documenting,
Analysis, Simulation-Models, Knowledge
Discovery, and Data Warehouses Reporting.
• OLAP enables end-clients to perform ad hoc
analysis of record in multiple dimensions,
providing the insight and understanding they
require for better decision making.
OLAP
• OLAP cubes have two main purposes. The first is to
provide business users with a data model more intuitive
to them than a tabular model. This model is called a
Dimensional Model.
• The second purpose is to enable fast query response that
is usually difficult to achieve using tabular models.
• How OLAP Works?
• Fundamentally, OLAP has a very simple concept. It pre-
calculates most of the queries that are typically very hard
to execute over tabular databases, namely aggregation,
joining, and grouping. These queries are calculated
during a process that is usually called 'building' or
'processing' of the OLAP cube.
Characteristics of OLAP
Difference between OLTP and OLAP

• OLTP (On-Line Transaction Processing) is featured by a large number of


short on-line transactions (INSERT, UPDATE, and DELETE).
• The primary significance of OLTP operations is put on very rapid query
processing, maintaining record integrity in multi-access environments, and
effectiveness consistent by the number of transactions per second.
• In the OLTP database, there is an accurate and current record, and
schema used to save transactional database is the entity model (usually
3NF).
• OLAP (On-line Analytical Processing) is represented by a relatively low
volume of transactions. Queries are very difficult and involve
aggregations.
• For OLAP operations, response time is an effectiveness measure. OLAP
applications are generally used by Data Mining techniques.
• In OLAP database there is aggregated, historical information, stored in
multi-dimensional schemas (generally star schema).
OLTP vs OLAP
• 1) Users: OLTP systems are designed for office worker while
the OLAP systems are designed for decision-makers.
Therefore while an OLTP method may be accessed by
hundreds or even thousands of clients in a huge enterprise,
an OLAP system is suitable to be accessed only by a select
class of manager and may be used only by dozens of users.
• 2) Functions: OLTP systems are mission-critical. They
provide day-to-day operations of an enterprise and are
largely performance and availability driven. These
operations carry out simple repetitive
operations. OLAP systems are management-critical to
support the decision of enterprise support tasks using
detailed investigation.
OLTP vs OLAP
• 3) Nature: Although SQL queries return a set of data, OLTP
methods are designed to step one record at the time, for
example, a data related to the user who may be on the phone
or in the store.
• OLAP system is not designed to deal with individual customer
records. Instead, they include queries that deal with many
data at a time and provide summary or aggregate information
to a manager. OLAP applications include data stored in a data
warehouses that have been extracted from many tables and
possibly from more than one enterprise database.
• 4) Design: OLTP database operations are designed to be
application-oriented while OLAP operations are designed to
be subject-oriented. OLTP systems view the enterprise record
as a collection of tables (possibly based on an entity-
relationship model). OLAP operations view enterprise
information as multidimensional).
OLTP vs OLAP
• 5) Data: OLTP systems usually deal only with the
current status of data. For example, a record about an
employee who left three years ago may not be
feasible on the Human Resources System. The old
data may have been achieved on some type of stable
storage media and may not be accessible online. On
the other hand, OLAP systems needed historical data
over several years since trends are often essential in
decision making.
• 6) Kind of use: OLTP methods are used for reading
and writing operations while OLAP methods usually
do not update the data.
OLTP vs OLAP
• 7) View: An OLTP system focuses primarily on the current data
within an enterprise or department, which does not refer to
historical data or data in various organizations. In contrast,
an OLAP system spans multiple version of a database schema,
due to the evolutionary process of an organization. OLAP
system also deals with information that originates from
different organizations, integrating information from many data
stores. Because of their huge volume, these are stored on
multiple storage media.
• 8) Access Patterns: The access pattern of an OLTP system
consist primarily of short, atomic transactions. Such a system
needed concurrency control and recovery techniques. However,
access to OLAP systems is mostly read-only operations because
these data warehouses store historical information.
OLTP vs OLAP
• The biggest difference between an OLTP and
OLAP system is the amount of data analyzed in
a single transaction. Whereas an OLTP handles
many concurrent customers and queries
touching only a single data or limited
collection of records at a time, an OLAP
system must have the efficiency to operate on
millions of data to answer a single query.
OLAP Operations
• In the multidimensional model, the records are
organized into various dimensions, and each
dimension includes multiple levels of abstraction
described by concept hierarchies.
• This organization support users with the flexibility
to view data from various perspectives.
• A number of OLAP data cube operation exist to
demonstrate these different views, allowing
interactive queries and search of the record at
hand. Hence, OLAP supports a user-friendly
environment for interactive data analysis.
OLAP Operations
• It is based on multidimensional data model and allows the user to
query on multi-dimensional data (eg. Delhi -> 2018 -> Sales data).
OLAP databases are divided into one or more cubes and these
cubes are known as Hyper-cubes.
OLAP Operations
• Roll-Up (Drill – up):
• It performs aggregation on a data cube, by climbing down concept
hierarchies, i.e., dimension reduction.
• Roll-up is like zooming-out on the data cubes.
• Specialization to generalization
• Figure shows the result of roll-up operations performed on the
dimension location. The hierarchy for the location is defined as the Order
Street, city, province, or state, country.
• The roll-up operation aggregates the data by ascending the location
hierarchy from the level of the city to the level of the country.
• When a roll-up is performed by dimensions reduction, one or more
dimensions are removed from the cube. For example, consider a sales
data cube having two dimensions, location and time. Roll-up may be
performed by removing, the time dimensions, appearing in an
aggregation of the total sales by location, relatively than by location and
by time.
OLAP Operations
OLAP Operations
• (Roll – Down) Drill-Down:
• It is the reverse operation of roll-up. Drill-down is like zooming-in on
the data cube.
• Generalization to specilization
• It navigates from less detailed record to more detailed data. Drill-
down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.
• Figure shows a drill-down operation performed on the dimension
time by stepping down a concept hierarchy which is defined as day,
month, quarter, and year. Drill-down appears by descending the time
hierarchy from the level of the quarter to a more detailed level of the
month.
• Because a drill-down adds more details to the given data, it can also
be performed by adding a new dimension to a cube. For example, a
drill-down on the central cubes of the figure can occur by
introducing an additional dimension, such as a customer group.
OLAP Operations
OLAP Operations
• Slice
• A slice is a subset of the cubes corresponding to a
single value for one or more members of the
dimension.
• For example, a slice operation is executed when
the customer wants a selection on one dimension
of a three-dimensional cube resulting in a two-
dimensional site.
• So, the Slice operations perform a selection on one
dimension of the given cube, thus resulting in a
subcube.
OLAP Operations
OLAP Operations
• Dice
• The dice operation describes a subcube by
operating a selection on two or more dimension.
• Consider the following diagram, which shows the
dice operations.
• The dice operation on the cubes based on the
following selection criteria involves three
dimensions.
• (location = "Toronto" or "Vancouver")
• (time = "Q1" or "Q2")
• (item =" Mobile" or "Modem")
OLAP Operations
OLAP Operations
• Pivot
• The pivot operation is also called a rotation.
• Pivot is a visualization operations which rotates the data axes in
view to provide an alternative presentation of the data.
• It may contain swapping the rows and columns or moving one of
the row-dimensions into the column dimensions.
OLAP Operations

You might also like