DWMM Notes
DWMM Notes
INDEX
Page | 1
Data Warehouse Introduction
Page | 2
Ques5: What do you mean when we say that a Data Warehouse is integrated?
It means that data from different sources like RDBMS, flat files and online transaction
records is combined and stored in a central repository. The integrated data includes both
current and historical information. It facilitates unified access to data for various purposes,
such as reporting and analysis. Integration ensures consistency and coherence of data across
the warehouse. Users can access integrated data seamlessly, regardless of its source, enabling
comprehensive analysis and decision-making. E.g. Account is a subject. Data from various
sources – savings account, current account and loan account can be stored in the subject
Account.
Ques5: What do you mean when we say that a Data Warehouse is time-variant?
It means that the Data Warehouse stores historical data alongside current information for
comparison and trend analysis. Users can access data from various time intervals, such as
daily, weekly, monthly, or annually. Time-variant data allows for tracking changes over time
and making informed decisions based on historical patterns.
Ques5: What do you mean when we say that a Data Warehouse is non-volatile?
It means that once the data has entered the Data Warehouse, it should not change. It should
not get updated with every transaction or operation like in operational databases. Non-
volatility ensures that historical data remains accessible for analysis and decision-making.
Users can rely on the consistency and stability of the data stored in the warehouse. Non-
volatility distinguishes data warehouses from operational databases, which are subject to
frequent updates and changes.
Page | 3
Ques6: What is an Operational Database?
The Operational Database is the source of information for the data warehouse. It includes
detailed information which is used to run the day-to-day operations of the business. The data
in an Operational Database is frequently updated and reflects the current value of the last
transaction. Operational Database Management Systems also called OLTP (Online
Transactions Processing Databases), are used to manage dynamic data in real time. Data
Warehouse Systems are called as Online-Analytical Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the goals of
both these databases are different.
Ques9: Explain the architecture of the data warehouse with an appropriate diagram.
https://www.youtube.com/watch?v=Eh_8ZATRauQ Copy diagram half
Page | 5
Ques10: What are the components of Source Data in a Data Warehouse?
The source data comprises of Production Data, Internal Data, Archived Data and External
Data.
1) Production Data: This is the day-to-day data which comes from different operational
systems in a company. Based on the data requirements in the data warehouse, parts of the
data are chosen.
2) Internal Data: Every organization has its private files like spreadsheets, reports, and
customer records. These are called internal data, some of which can be useful in a data
warehouse.
3) Archived Data: It is the historical data.
4) External Data: It is the data relevant to the industry provided by the external departments.
The source data from various operational systems and external sources is extracted and fed
into the data staging area of the data warehouse where it is changed, converted and made
ready in a format which can be used for querying and analysis.
Page | 6
Ques12: Explain the concept of metadata in a data warehouse.
Metadata is data about data, which provides information about the other data stored in the
data warehouse. It helps users understand the data content, its origin, and how it can be
utilized for analytical purposes. Metadata includes attributes like data types, constraints,
integrity rules, and storage details, offering comprehensive insights into the data. Proper
management of metadata ensures the accuracy, consistency, and reliability of data stored in
the data warehouse.
Ques14: What are the different steps for creating a data warehouse?
The different steps for creating a data warehouse are:-
1. Determine the Business Objectives: Understand the organization's goals and objectives
for building the data warehouse, including the specific business questions it aims to answer.
2. Collect and Analyze Information: Gather information about data sources, formats, and
business processes to assess the data needs and requirements
3. Identify Core Business Processes: Identify the key business processes and data elements
required to support the organization's objectives.
4. Create a Source Data Model: Develop a model that defines the structure and
relationships of the source data to be used in the data warehouse.
Page | 7
5. Conceptualize and Select the Platform: Decide on the technology platform and
architecture for the data warehouse, considering factors like scalability, performance, and
budget.
6. Develop a Project Roadmap: Outline a roadmap with timelines, milestones, and resource
requirements for building the data warehouse.
7. Design and Implement ETL Processes: Develop Extract, Transform, and Load (ETL)
processes to extract data from source systems, transform it into the desired format, and load it
into the data warehouse.
8. Implement Data Governance and Security: Establish data governance policies and
security measures to ensure the integrity, confidentiality, and availability of data.
Page | 8
Ques17: Why are data marts created?
Data Marts are created for various reasons:-
1) Data marts provide easier access to specific subsets of data compared to dealing with the
entire data warehouse. This facilitates quicker retrieval of relevant information for decision-
making.
2) They allow organizations to focus on particular business areas or departments, enabling
more targeted analysis and reporting.
3) Data marts can be tailored to meet the specific needs of different user groups or business
units within an organization. This customization ensures that users have access to the most
relevant and useful data for their purposes
4) By concentrating on specific subject areas, data marts simplify data maintenance tasks
compared to managing a large, comprehensive data warehouse.
5) Data marts can be scaled independently, allowing organizations to expand their analytical
capabilities incrementally as needed.
Page | 9
2) Cost-Effectiveness: Data marts are cost-effective because they store a particular subset of
data, which lowers data storage costs compared to maintaining a comprehensive data
warehouse.
3) Ease of Implementation: Implementing a data mart requires less time compared to a data
warehouse, as data marts are designed for specific purposes and subsets of data.
4) Improved Access: Data marts enable easier and faster access to data for specific user
groups or departments, enhancing decision-making and operational efficiency.
Page | 10
Ques21: Explain the different steps in implementing a Data Mart.
1) Designing: This is the first step in implementing a data mart. It covers all the functions
from initiating the request for a data mart to gathering data about the requirements and
developing the logical and physical design of the data mart. It involves the following tasks:
a) Gathering the business and technical requirements
b) Identifying the data sources
c) Selecting the appropriate subset of data
d) Designing the logical and physical architecture of the data mart
2) Constructing: This step involves creating the physical database and the logical structures
associated with the data mart to provide fast and efficient access to the data.
It involves the following tasks:
a) Creating the physical database and logical structures such as tablespaces associated with
the data mart.
b) Creating the schema objects such as tables and indexes described in the design step.
c) Determining how to best set up the tables and access structures.
3) Populating: This step includes all of the tasks related to getting data from the source,
cleaning it up, modifying it to the right format and level of detail and moving it into the data
mart. It involves the following tasks:
a) Mapping data sources to target data sources
b) Extracting data
c) Cleansing and transforming the information
d) Loading data into the data mart
e) Creating and storing metadata
4) Accessing: This step involves putting the data to use which involves querying the data,
analyzing it, creating reports, charts and graphs and publishing them. It involves the
following tasks:
a) Setting up an intermediate layer for the front-end tool to use.
b) Setting up and managing database architectures like summarized tables
5) Managing: This step involves managing the data mart throughout its lifetime. It involves
the following tasks:
a) Providing secure access to the data.
b) Managing the growth of data.
c) Optimizing the system for better performance
d) Ensuring the availability of data event with system failures
Page | 11
Dimensional Modelling
Ques1: Within a data warehouse, how is the data organised and represented?
Data in a data warehouse is usually multidimensional, having measure attributes (facts) and
dimension attributes (dimensions). Multidimensional data refers to data being organized and
represented in multiple dimensions. Multidimensional data is typically represented as a data
cube. This cube consists of multiple dimensions such as time, geography, product, and
customer. A data cube enables data to be modelled and viewed in multiple dimensions. This
structure allows for more complex analysis and querying compared to traditional relational
databases.
Ques2: What is a fact, fact table, dimension, dimension table?
Page | 12
Fact: A fact represents the numerical data or metrics that are of interest for analysis. These
are typically quantitative measures, such as sales revenue, quantity sold, or profit margin.
Facts provide the core data that analysts analyze and report on.
Fact Table: A fact table is the central table in a dimensional model. It contains the facts or
measures along with foreign keys referencing the dimension tables. Fact tables store the
quantitative information for analysis and are surrounded by dimension tables. Examples of
fact tables include sales transactions, order details, or inventory levels.
Dimension: Dimensions are the descriptive attributes that provide context to the facts. They
represent the various ways the data can be analyzed or categorized. Dimensions are typically
hierarchical and include attributes such as time, geography, product, or customer.
Dimension Table: Dimension tables contain the attributes of dimensions. Each dimension
table represents a specific dimension and includes descriptive information about that
dimension. For example, a time dimension table may contain attributes like year, month,
quarter, and day. Dimension tables provide the context and background information necessary
for analyzing the facts stored in the fact table.
Page | 13
sum or average.
5)Contains lesser number of attributes than 5) Contains more number of attributes than
a dimension table. a fact table.
6) Contains more number of records than a 6) Contains less number of records than a
dimension table. fact table.
7) The fact table forms a vertical table. 7) The dimension table forms a horizontal
table.
8) In a schema, the number of fact tables is 8) In a schema, the number of dimension
less than the number of dimension tables. tables is more than the number of fact
tables.
9) Used for analysis, reporting, and 9) Dimension Table: Provides context and
aggregations in data analysis and business background information for analyzing the
intelligence. data in the fact table
10) Pure fact table is a collection of foreign 10) Pure dimension table is a collection of
keys. primary keys.
Reference: https://www.javatpoint.com/data-warehouse-what-is-data-cube
Page | 14
3D view of Sales Data
Suppose we would like to view the sales data in dollars (in thousands) according
to time, item and location for the cities of Chicago, New York, Toronto, and
Vancouver. Here, time, item and location are the three dimensions. The 3-D data
of the table are represented as a series of 2-D tables.
Conceptually, we can represent the same data in the form of 3-D data cubes, as shown in fig:
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest
level of summarization is called a base cuboid. The topmost 0-D cuboid, which holds the
highest level of summarization, is known as the apex cuboid.
Page | 15
Ques5: What is a data warehouse schema?
A data warehouse schema is a logical design that represents the structure of a data warehouse.
It defines how data is organized, stored, and accessed within the data warehouse
environment.
Ques6: Explain the different types of data warehouse schema. Also, list their advantages
and disadvantages.
a) Star Schema: It represents a multidimensional data model. It is known as star
schema because the entity-relationship diagram of this schema resembles a star with the fact
table at the center and dimension tables surrounding it.
Advantages
1) Star schema simplifies the queries by reducing the number of joins required, resulting in
faster query performance.
2) It enables fast aggregations and calculations, such as total items sold or revenue, making it
suitable for analytical queries.
3) Due to its simplicity, star schema is easier to maintain and modify.
Disadvantages
1) Star schema can be rigid, making it challenging to accommodate changes or additions to
the data model.
2) It may lead to data quality issues, particularly when handling denormalized data,
potentially impacting data integrity.
3) Managing a star schema with numerous dimension tables can introduce complexity,
affecting maintenance and performance
Page | 16
b) Snowflake Schema: The snowflake schema is an expansion of the star schema where each
point of the star explodes into more points. It is called snowflake schema because the diagram
of snowflake schema resembles a snowflake.
Snowflaking is a method of normalizing the dimension tables in star schemas.
When we normalize all the dimension tables entirely, the resultant structure resembles a
snowflake with the fact table in the middle. The snowflake schema consists of one fact table
which is linked to many dimension tables, which can be linked to other dimension tables
through a many-to-one relationship. Tables in a snowflake schema are generally normalized
to the third normal form. Each dimension table performs exactly one level in a hierarchy.
Advantages
1) Snowflake schema reduces data redundancy by normalizing dimension tables, leading to
efficient storage
2) Normalization reduces data redundancy, offering protection from inconsistencies and
ensuring accurate data analysis
3) The minimized disk storage requirements and smaller lookup tables enhance query
performance.
4) Snowflake schema provides structured data enhancing data integrity and organization.
Disadvantages
1) Snowflake schema needs more maintenance and is complex to manage because it involves
additional lookup tables.
2) Snowflake schema may lead to complicated queries because of large number of joins
between tables. This can slow down performance.
3) Maintaining Snowflake schema can be expensive.
c) Fact Constellation Schema: In this schema, multiple fact tables share the same
dimension tables. It is also called as galaxy schema.
This schema describes the logical structure of data warehouse.
Page | 17
Advantages
1) Fact constellation schema allows flexible data modeling.
2) It facilitates rich analysis by providing multiple paths for analyzing data.
Disadvantages
1) Designing, implementing, and maintaining a fact constellation schema can be more
challenging compared to simpler schemas like star schema due to its intricate structure.
2) It may lead to data redundancy because of repeated dimension table which can impact
storage efficiency.
3) Multiple fact tables make galaxy schema complex.
Ques7: What is the difference between star schema and snowflake schema.
Star Schema Snowflake Schema
1) Resembles a star, with the fact table at 1) Looks like a snowflake, with fact table at
the center and dimension tables around it. the center, connected to dimension tables
that further branch out into sub-dimensions.
2) It follows a top-down design approach. 2) It's more complex to design in
comparison to star schema.
3) Uses more space due to denormalized 3) Generally uses less space due to
dimension tables. normalized structure.
4) More data dependency and redundancy. 4) Less data dependency and redundancy.
5) Complicated joins are not required. 5) Complicated joins are required.
Ques8: What is the difference between snowflake schema and fact constellation schema.
Snowflake Schema Fact Constellation Schema
1) It contains a large central fact table, 1) In this schema, multiple fact tables share
dimension tables and sub-dimension tables. the same dimension tables.
2) It consists of one star schema at a time. 2) It consists of more than one star schema
at a time.
3) It is a normalized form of star schema. 3) It is a normalized form of snowflake
schema and star schema.
4) It is easy to operate because it has less 4) It is difficult to operate because of
Page | 18
number of joins between the tables. multiple joins between the tables.
5) In snowflake schema, a simple query can 5) In fact constellation schema, a complex
be used to access the data from the database. query has to be used to access the data from
the database.
Ques9: What is the difference between star schema and fact constellation schema.
Star Schema Fact Constellation Schema
1) Each dimension is represented with only 1) In this schema, multiple fact tables share
one dimension table. the same dimension tables.
2) It is easy to maintain the tables. 2) It is difficult to maintain the tables.
3) It does not use normalization 3) It is a normalized form of snowflake
schema and star schema.
4) ) In star schema, a simple query can be 4) In fact constellation schema, a complex
used to access the data from the database. query has to be used to access the data from
the database.
5) It is easy to operate because it has less 5) It is difficult to operate because of
number of joins between the tables. multiple joins between the tables.
Page | 19
4) Alternate Key: All candidate keys other than the primary key are called as alternate
key. It is a secondary key.
5) Foreign Key: This key acts as primary key in one table and as secondary key in
another table. It follows referential integrity constraint. In referential integrity
constraint, we wish to ensure that a value which appears in one relation for a given set
of attributes also appears for a certain set of attributes in another relation.
6) Composite Key: Sometimes, a table might not have a single attribute that uniquely
identifies all the records of a table. To uniquely identify rows of a table, a
combination of two or more attributes can be used. So composite key acts as a
primary key if there is no primary key in a table. In rare cases, a composite key can
give duplicate values. So, we need to find the optimal set of attributes that can
uniquely identify rows in a table.
7) Surrogate Key: It is also called as synthetic primary key. It is a sequential number
which is automatically generated by the database. This number is outside of the
database which is made available to the user or the application. The value of surrogate
key cannot be modified by the user or the application. If we do not have a natural
primary key in the table then we need to artificially create a primary key (surrogate
key) to uniquely fetch a record from the table. The surrogate key is called as fact less
key. It is added just for the ease of identification of unique values but it contains no
relevant fact or information which is useful for the table.
Ques10: What is the difference between multidimensional data model and relational
data model ?
Multidimensional data model Relational data model
1) Here, the data is organised in a cube like 1) Here, the data is organised in tables with
structure. rows and columns.
2) It is suitable for analyzing large amounts 2) It is ideal for managing structured data.
of data efficiently.
3) Here, the data is in denormalized form. 3) Here, the data is in normalized form.
4) It is mainly used for OLAP. 4) It is mainly used for OLTP.
Page | 20
5) MDX language is used for querying 5) SQL is used for querying relational
multidimensional databases. databases.
6) It is designed to handle data with multiple 6) It requires defining relationships between
perspectives and dimensions, allowing tables using keys and foreign keys to ensure
quick analysis from different angles. data accuracy and integrity.
Page | 21
4) It produces a structured and organized view of data through reports, dashboards, and
analytics tools for effective decision-making.
5) Data warehouse allows users to access critical data from a number of sources in a
single place. In this way, it saves user’s time of retrieving data from multiple sources.
Ques13: What are the different views of data warehouse?
The different views of a data warehouse are:-
1) Top-down view: This view allows the selection of relevant information necessary for
the data warehouse.
2) Data source view: This view presents the way in which the data is being captured,
stored and managed by the data warehouse system.
3) Data warehouse view: This view includes the fact tables and dimension tables. It not
only represents the information which is stored within the data warehouse but also the
information regarding the data, time of origin of the source data to provide the
historical context.
4) Business query view: This view gives the perspective of data in the data warehouse
from the viewpoint of the end user.
Page | 22
Ques15: What is HOLAP?
Hybrid OLAP (HOLAP) is a data processing approach that combines the benefits of
both relational OLAP (ROLAP) and multidimensional OLAP (MOLAP) systems. In
HOLAP, data is stored using a combination of multidimensional data structures and
relational database tables. This allows efficient handling of large volumes of detailed
data while also providing the flexibility and scalability of relational databases.
Page | 23