Data Warehousing Fundamentals
Data Warehousing Fundamentals
Data Warehousing Fundamentals
Syllabus
Introduction to Data Warehouse, Data warehouse architecture,
Data warehouse versus Data Marts, E-R Modeling versus
Dimensional Modeling, Information Package Diagram, Data
Warehouse Schemas; Star Schema, Snowflake Schema, Factless
Fact Table, Fact Constellation Schema. Update to the dimension
tables. Major steps in ETL process, OLTP versus OLAP, OLAP
operations: Slice, Dice, Rollup, Drilldown and Pivot.
Introduction to Data Warehouse
• Data Warehouse is a relational database management
system (RDBMS) construct to meet the requirement of
transaction processing systems.
• It can be loosely described as any centralized data repository
which can be queried for business benefits.
• It is a database that stores information oriented to satisfy
decision-making requests.
• It is a group of decision support technologies, targets to
enabling the knowledge worker (executive, manager, and
analyst) to make superior and higher decisions.
• So, Data Warehousing support architectures and tool for
business executives to systematically organize, understand
and use their information to make strategic decisions.
What is a Data Warehouse?
• A Data Warehouse (DW) is a relational
database that is designed for query and
analysis rather than transaction processing.
• It includes historical data derived from
transaction data from single and multiple
sources.
• A Data Warehouse provides integrated,
enterprise-wide, historical data and focuses on
providing support for decision-makers for data
modeling and analysis.
What is a Data Warehouse?
• A Data Warehouse is a group of data specific to the entire organization, not
only to a particular group of users.
• It is not used for daily operations and transaction processing but used for
making decisions.
• A Data Warehouse can be viewed as a data system with the following
attributes:
• It is a database designed for investigative tasks, using data from various
applications.
• It supports a relatively small number of clients with relatively long
interactions.
• It includes current and historical data to provide a historical perspective of
information.
• Its usage is read-intensive.
• It contains a few large tables.
• "Data Warehouse is a subject-oriented, integrated, and time-variant store
of information in support of management's decisions."
Characteristics of Data Warehouse
• Subject-Oriented
• A data warehouse target on the modeling and
analysis of data for decision-makers.
• Therefore, data warehouses typically provide a
concise and straightforward view around a
particular subject, such as customer, product, or
sales, instead of the global organization's ongoing
operations.
• This is done by excluding data that are not useful
concerning the subject and including all data
needed by the users to understand the subject.
Characteristics of Data Warehouse
Characteristics of Data Warehouse
• Integrated
• A data warehouse integrates various
heterogeneous data sources like RDBMS, flat
files, and online transaction records.
• It requires performing data cleaning and
integration during data warehousing to ensure
consistency in naming conventions, attributes
types, etc., among different data sources
Characteristics of Data Warehouse
Characteristics of Data Warehouse
• Time-Variant
• Historical information is kept in a data
warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even
previous data from a data warehouse.
• These variations with a transactions system,
where often only the most current file is kept.
Characteristics of Data Warehouse
• Non-Volatile
• The data warehouse is a physically separate data storage,
which is transformed from the source operational RDBMS.
• The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not
performed.
• It usually requires only two procedures in data accessing:
Initial loading of data and access to data. Therefore, the DW
does not require transaction processing, recovery, and
concurrency capabilities, which allows for substantial
speedup of data retrieval.
• Non-Volatile defines that once entered into the warehouse,
and data should not change.
Characteristics of Data Warehouse
Need for Data Warehouse
• Data Warehouse is needed for the following reasons:
• 1) Business User: Business users require a data warehouse to view
summarized data from the past. Since these people are non-technical, the
data may be presented to them in an elementary form.
• 2) Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various
purposes.
• 3) Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making
strategic decisions.
• 4) For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring the
uniformity and consistency in data.
• 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree
of flexibility and quick response time.
Benefits of Data Warehouse
• The Source Data component shows various internal & external data
sources such as Production Data, Internal Data, Archived Data, External
Data.
• The extracted data coming from several different sources need to be
changed, converted, and made ready in a format that is relevant to be
saved for querying and analysis. The Data staging element serves as the
next building block and performs all these tasks.
• In the middle, we see the Data Storage component that handles the data
warehouses data. This element not only stores and manages the data; it
also keeps track of data using the metadata repository.
• The Information Delivery component shows on the right consists of all
the different ways of making the information from the data warehouses
available to the users.
• The management and control elements coordinate the services and
functions within the data warehouse. These components control the
data transformation and the data transfer into the data warehouse
storage.
Data warehouse architecture
• Query Performance
• A star schema database has a limited number of table
and clear join paths, the query run faster than they do
against OLTP systems.
• Small single-table queries, frequently of a dimension
table, are almost instantaneous. Large join queries that
contain multiple tables takes only seconds or minutes
to run.
• In a star schema database design, the dimension is
connected only through the central fact table. When
the two-dimension table is used in a query, only one
join path, intersecting the fact tables, exist between
those two tables. This design feature enforces authentic
and consistent query results.
Advantages of Star Schema
• Easily Understood
• A star schema is simple to understand and
navigate, with dimensions joined only through
the fact table.
• These joins are more significant to the end-
user because they represent the fundamental
relationship between parts of the underlying
business.
• Customer can also browse dimension table
attributes before constructing a query.
Star Schema
• Example: Suppose a star schema is composed of a
fact table, SALES, and several dimension tables
connected to it for time, branch, item, and
geographic locations.
• The TIME table has a column for each day, month,
quarter, and year. The ITEM table has columns for
each item_Key, item_name, brand, type,
supplier_type. The BRANCH table has columns for
each branch_key, branch_name, branch_type.
The LOCATION table has columns of geographic
data, including street, city, state, and country.
Star Schema
Star Schema
• In this scenario, the SALES table contains only four
columns with IDs from the dimension tables, TIME, ITEM,
BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for
BRANCH data, and four columns for LOCATION data.
Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a
single change in the dimension table, instead of making
many changes in the fact table.
• We can create even more complex star schemas by
normalizing a dimension table into several tables. The
normalized dimension table is called a Snowflake.
What is Snowflake Schema?
• A snowflake schema is equivalent to the star schema. "A schema is known as a
snowflake if one or more dimension tables do not connect directly to the fact
table but must join through other dimension tables."
• The snowflake schema is an expansion of the star schema where each point of the
star explodes into more points.
• It is called snowflake schema because the diagram of snowflake schema
resembles snowflake .
• Snowflaking is a method of normalizing the dimension tables in a STAR schemas.
When we normalize all the dimension tables entirely, the resultant structure
resembles a snowflake with the fact table in the middle.
• Snowflaking is used to develop the performance of specific queries. The schema is
diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake
pattern.
• The snowflake schema consists of one fact table which is linked to many
dimension tables, which can be linked to other dimension tables through a many-
to-one relationship. Tables in a snowflake schema are generally normalized to the
third normal form. Each dimension table performs exactly one level in a hierarchy.
Snowflake Schema
• The following diagram shows a snowflake schema with two
dimensions, each having three levels. A snowflake schemas
can have any number of dimension, and each dimension can
have any number of levels.
Snowflake Schema
• Example : Unlike Star schema, the dimensions table in a snowflake
schema are normalized. For example, the item dimension table in star
schema is normalized and split into two dimension tables, namely item
and supplier table.
• Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Snowflake Schema
• Advantage of Snowflake Schema
• The primary advantage of the snowflake schema is the
development in query performance due to minimized disk
storage requirements and joining smaller lookup tables.
• It provides greater scalability in the interrelationship between
dimension levels and components.
• No redundancy, so it is easier to maintain.
• Disadvantage of Snowflake Schema
• The primary disadvantage of the snowflake schema is the
additional maintenance efforts required due to the increasing
number of lookup tables. It is also known as a multi fact star
schema.
• There are more complex queries and hence, difficult to
understand.
• More tables more join so more query execution time.
Factless Fact Table
• A fact table contains facts or measures.
• Factless fact tables are only used to establish relationships between
elements of different dimensions.
• No measures or facts are associated with these.
Factless Fact Table
• There are two types of factless table :
• 1. Event Tracking Tables :
• Use a factless fact table to track events of
interest to the organization.
• 2. Coverage Tables :
• It is used to support negative analysis reports.
For example, to create a report that a store
did not sell a product for a certain period of
time, you should have a fact table to capture
all possible combinations. Then you can find
out what is missing.
Fact Constellation Schema.
• A Fact constellation means two or more fact
tables sharing one or more dimensions. It is
also called Galaxy schema.
• Fact Constellation Schema describes a logical
structure of data warehouse or data mart.
• Fact Constellation Schema can design with a
collection of de-normalized FACT, Shared, and
Conformed Dimension tables.
Fact Constellation Schema.
Fact Constellation Schema.
Fact Constellation Schema.
• This schema defines two fact tables, sales, and shipping.
• Sales are treated along four dimensions, namely, time,
item, branch, and location. The schema contains a fact
table for sales that includes keys to each of the four
dimensions, along with two measures: Rupee_sold and
units_sold.
• The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location,
and two measures: Rupee_cost and units_shipped.
• The primary disadvantage of the fact constellation schema
is that it is a more challenging design because many
variants for specific kinds of aggregation must be
considered and selected.
ETL process
• The mechanism of extracting information from source
systems and bringing it into the data warehouse is
commonly called ETL, which stands for Extraction,
Transformation and Loading.
• The ETL process requires active inputs from various
stakeholders, including developers, analysts, testers,
top executives and is technically challenging.
• To maintain its value as a tool for decision-makers,
Data warehouse technique needs to change with
business changes.
• ETL is a recurring method (daily, weekly, monthly) of a
Data warehouse system and needs to be agile,
automated, and well documented.
ETL process
How ETL Works?
• Extraction
• Extraction is the operation of extracting information
from a source system for further use in a data
warehouse environment. This is the first stage of the
ETL process.
• Extraction process is often one of the most time-
consuming tasks in the ETL.
• The source systems might be complicated and poorly
documented, and thus determining which data needs
to be extracted can be difficult.
• The data has to be extracted several times in a
periodic manner to supply all changed data to the
warehouse and keep it up-to-date.
How ETL Works?
• Cleansing
• The cleansing stage is crucial in a data warehouse technique because
it is supposed to improve data quality. The primary data cleansing
features found in ETL tools are rectification and homogenization. They
use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific
rules and defines appropriate associations between values.
• The following examples show the essential of data cleaning:
• If an enterprise wishes to contact its users or its suppliers, a
complete, accurate and up-to-date list of contact addresses, email
addresses and telephone numbers must be available.
• If a client or supplier calls, the staff responding should be quickly able
to find the person in the enterprise database, but this need that the
caller's name or his/her company name is listed in the database.
• If a user appears in the databases with two or more slightly different
names or different account numbers, it becomes difficult to update
the customer's information.
How ETL Works?
• Transformation
• Transformation is the core of the reconciliation phase. It converts
records from its operational source format into a particular data
warehouse format. If we implement a three-layer architecture, this
phase outputs our reconciled data layer.
• The following points must be rectified in this phase:
• Loose texts may hide valuable information. For example, XYZ PVT Ltd
does not explicitly show that this is a Limited Partnership company.
• Different formats can be used for individual data. For example, data
can be saved as a string or as three integers.
• Following are the main transformation processes aimed at populating
the reconciled data layer:
• Conversion and normalization that operate on both storage formats
and units of measure to make data uniform.
• Matching that associates equivalent fields in different sources.
• Selection that reduces the number of source fields and records.
How ETL Works?
• In this step, a set of rules or functions are applied on the
extracted data to convert it into a single standard format.
• It may involve following processes/tasks:
• Filtering – loading only certain attributes into the data
warehouse.
• Cleaning – filling up the NULL values with some default
values, mapping U.S.A, United States, and America into USA,
etc.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute
(generally key-attribute).
How ETL Works?
• Cleansing and Transformation processes are
often closely linked in ETL tools.
How ETL Works?
• Loading
• The Load is the process of writing the data into the target
database. During the load step, it is necessary to ensure that the
load is performed correctly and with as little resources as
possible.
• Loading can be carried in two ways:
• Refresh: Data Warehouse data is completely rewritten. This
means that older file is replaced. Refresh is usually used in
combination with static extraction to populate a data warehouse
initially.
• Update: Only those changes applied to source information are
added to the Data Warehouse. An update is typically carried out
without deleting or modifying preexisting data. This method is
used in combination with incremental extraction to update data
warehouses regularly.
How ETL Works?
• ETL process can also use the pipelining concept i.e. as soon as some data
is extracted, it can transformed and during that period some new data can
be extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed. The block
diagram of the pipelining of ETL process is shown below:
ETL Process
• ETL Tools: Most commonly used ETL tools are Hevo,
Sybase, Oracle Warehouse builder, CloverETL, and
MarkLogic.
• Data Warehouses: Most commonly used Data
Warehouses are Snowflake, Redshift, BigQuery, and
Firebolt.
• ETL process is an essential process in data warehousing
that helps to ensure that the data in the data warehouse
is accurate, complete, and up-to-date. However, it also
comes with its own set of challenges and limitations, and
organizations need to carefully consider the costs and
benefits before implementing them.
What is OLAP (Online Analytical Processing)?
• OLAP stands for On-Line Analytical Processing.
• OLAP is a classification of software technology which authorizes
analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide
variety of possible views of data that has been transformed
from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.
• OLAP implement the multidimensional analysis of business
information and support the capability for complex estimations,
trend analysis, and sophisticated data modeling.
What is OLAP (Online Analytical Processing)?