Data Warehousing INTERVIEW QUESTION
Data Warehousing INTERVIEW QUESTION
INTERVIEW
PREPERATION SERIES
( ALL SERIES IN SINGLE DOCUMENT )
🔴 Que 1.) What is a data warehouse ❓
➡ A data warehouse is a centralized repository that stores large amounts of structured and unstructured data
from various sources. ➡ It is designed to support business intelligence activities such as data analysis, reporting,
and decision-making.
➡ A data warehouse typically contains historical data that has been extracted, transformed, and loaded from
various transactional systems.
➡ Its purpose is to provide a single source of truth for the organization, enabling users to make informed
decisions based on reliable data.
➡ Data warehouses are optimized for query and analysis performance and can support advanced analytics
techniques such as predictive analytics and machine learning.
➡ A data warehouse is a Subject-Oriented, Integrated, Time-Variant and Non-Volatile collection of data in
support of management's decision making process.
🔴 Que 2.) What is a dimension table and a fact table in data warehouse ❓
➡ In data warehousing, a dimension table is a table that contains descriptive attributes that can be used to filter
and group data in a fact table. A fact table, on the other hand, is a table that contains quantitative measures or
facts that can be analyzed using the dimensions in a dimension table.
➡ Dimension Table:
👉 A dimension table is a table that contains descriptive attributes that provide context for the data in a fact
table. These attributes are used to filter and group data in the fact table.
👉 For example, in a sales data warehouse, a dimension table might contain information about customers,
products, sales channels, and time periods.
👉 Each record in the dimension table represents a unique combination of attributes that define a particular
entity, such as a customer or a product.
👉 Dimension tables are used to describe dimensions; i.e. they contain primary keys, and the detailed values and
attributes related to Dimensions. Without having the dimensions using fact table is meaningless.
➡ Fact Table:
👉 A fact table is a table that contains quantitative measures or facts that can be analyzed using the dimensions
in a dimension table.
👉 These facts can be numeric values such as sales revenue, quantities sold, or profit margins. Each record in a
fact table represents a unique combination of dimensions and associated measures.
👉 Fact tables contain Foreign Keys referring to Dimension tables where descriptive information is kept as well
as measurable facts that Data Analysts would want to examine.
👉 Fact Tables are designed to a low level of uniform detail (referred to as "Granularity" or "Grain"), i.e. Facts can
record events at a very atomic level. This can result in the accumulation of a large number of records in a fact
table over a period of time.
🔴 Que 3.) What are different types of dimension in data warehouse ❓
➡ In the world of data warehousing, dimensions refer to the attributes that describe a particular aspect of a
business. A dimension can be used to organize data in a data warehouse and is often associated with a hierarchy
of attributes. Here are some types of dimensions that are commonly used in data warehousing:
👉 Slowly Changing Dimension (SCD): As the name suggests, a slowly changing dimension is a type of
dimension that changes slowly over time. It is used to track changes in historical data. For example, a customer's
address or name may change over time, and an SCD is used to keep track of these changes while maintaining
historical records. There are different types of SCDs, including type 1, type 2, and type 3, each with its own
approach to handling changes in data.
👉 Fast Changing Dimension: In contrast to SCDs, fast-changing dimensions are those that change frequently.
These dimensions are not tracked historically but instead are updated regularly to reflect the current state of the
business. Examples of fast-changing dimensions include inventory levels or stock prices.
👉 Role Playing Dimension: A role-playing dimension is a dimension that is used in multiple ways within the
same fact table. For example, a date dimension can be used to track sales by day, week, or month, and it can also
be used to track production schedules or employee schedules. Each usage of the dimension is called a role, and
the dimension is said to be playing multiple roles.
👉 Garbage/Junk Dimension: A garbage or junk dimension is a dimension that is used to store miscellaneous or
non-specific data that doesn't fit into any other dimension. For example, a product description may contain
keywords that are not tracked in any other dimension, such as color or size. These keywords can be stored in a
junk dimension, allowing them to be analyzed and grouped together with other similar keywords.
➡ Each of these dimensions plays a specific role in organizing and analyzing data in a data warehouse. By
understanding these different types of dimensions, businesses can design a data warehouse that meets their
specific needs and provides valuable insights into their operations.
🔴 Que 4.) Explain Slowly Changing Dimension 1 (SCD1) in data warehouse ❓
➡ Slowly Changing Dimensions (SCD) are an essential aspect of data warehousing, where data changes over
time, but still needs to be tracked and analyzed. Type 1 SCD is the simplest and easiest approach to handle
dimension changes. It involves overwriting the old data with the new data without maintaining a history of
changes.
➡ Type 1 SCD is suitable for dimensions that don't change frequently, and where historical data isn't important.
For example, if a product's name changes, we can overwrite the old name with the new name without creating a
new record. This approach is also suitable for dimensions where the data is volatile, and only the most recent
data is required.
➡ The advantages of using Type 1 SCD are its simplicity, low storage requirements, and faster query
performance. It requires fewer database resources and is easy to implement. Since there is no need to create new
records, it saves storage space, and query performance is faster as there are fewer records to search through.
➡ However, the disadvantage of using Type 1 SCD is that it overwrites the old data, which means that the
historical data is lost. This can be problematic if there is a need to track the changes in the dimension data over
time. For example, if the price of a product changes frequently, we cannot track the price changes using Type 1
SCD.
➡ In conclusion, Type 1 SCD is a simple and easy approach to handle dimension changes in data warehousing.
It's suitable for dimensions that don't change frequently, and where historical data isn't important. However, if
we need to track the changes in the dimension data over time, we should consider using Type 2 or Type 3 SCD.
It's essential to choose the appropriate SCD approach based on the specific needs of the project.
➡ Example:
In our example, we originally have the following table:
____________________________
| Key | Name | State |
-----------------------------
| 1001 | Christina | Illinois |
-----------------------------
After Christina moved from Illinois to California, the new information replaces the new record, and we have the
following table:
_______________________________
| Key | Name | State |
--------------------------------
| 1001 | Christina | California |
--------------------------------
🔴 Que 5.) Explain Slowly Changing Dimension 2 (SCD2) in data warehouse ❓
➡ Type 2 SCD is the most commonly used approach to handle dimension changes. It involves creating a new
record for each change in dimension data and preserving the old data by assigning a new unique identifier.
➡ Type 2 SCD is suitable for dimensions that change frequently and require a history of changes. For example, if
a customer changes their address, we create a new record with a new unique identifier to preserve the old
address. This approach allows us to track changes in the dimension data over time and provides a full history of
the data.
➡ The advantages of using Type 2 SCD are that it allows us to track changes in dimension data over time and
provides a complete history of the data. It also preserves the old data, which is essential for analysis and
reporting. Type 2 SCD is more flexible than Type 1 SCD, as it can handle dimensions with more changes over
time.
➡ However, the disadvantage of using Type 2 SCD is that it requires more storage space than Type 1 SCD, as we
need to create new records for every change in the dimension data. This approach can also slow down query
performance, as there are more records to search through.
➡In conclusion, Type 2 SCD is the most commonly used approach to handle dimension changes in data
warehousing. It's suitable for dimensions that change frequently and require a history of changes. Type 2 SCD
allows us to track changes in the dimension data over time and provides a complete history of the data.
However, we should be aware that it requires more storage space and can slow down query performance. It's
essential to choose the appropriate SCD approach based on the specific needs of the project.
➡ Example:
In our example, we originally have the following table:
____________________________
| Key | Name | State |
-----------------------------
| 1001 | Christina | Illinois |
-----------------------------
After Christina moved from Illinois to California, the new information replaces the new record, and we have the
following table:
_______________________________
| Key | Name | State |
--------------------------------
| 1001 | Christina | Illinois |
--------------------------------
| 1005 | Christina | California |
--------------------------------
🔴 Que 6.) Explain Slowly Changing Dimension 3 (SCD3) in data warehouse ❓
➡ Type 3 SCD is a hybrid approach that combines elements of both Type 1 and Type 2 SCD. It involves creating
separate columns for the current and previous versions of the data.
➡ Type 3 SCD is suitable for dimensions where only the current and previous versions of the data are important.
For example, if a product's price changes frequently, we can create separate columns for the current and
previous prices. This approach allows us to track the changes in the dimension data over time and provides some
history of the data.
➡ The advantages of using Type 3 SCD are that it requires less storage space than Type 2 SCD, as we only need
to store the current and previous versions of the data. It's also faster than Type 2 SCD, as there are fewer records
to search through. Type 3 SCD is more flexible than Type 1 SCD, as it can handle dimensions with some changes
over time.
➡ However, the disadvantage of using Type 3 SCD is that it's limited in the history of changes it can track. It only
provides some history of the data, as we only store the current and previous versions of the data. This approach
is also less flexible than Type 2 SCD, as it can't handle dimensions with a lot of changes over time.
➡ In conclusion, Type 3 SCD is a hybrid approach that combines elements of both Type 1 and Type 2 SCD. It's
suitable for dimensions where only the current and previous versions of the data are important. Type 3 SCD
requires less storage space than Type 2 SCD, and it's faster than Type 2 SCD. However, it's limited in the history
of changes it can track and less flexible than Type 2 SCD. It's essential to choose the appropriate SCD approach
based on the specific needs of the project.
➡ Example:
In our example, we originally have the following table:
____________________________
| Key | Name | State |
-----------------------------
| 1001 | Christina | Illinois |
-----------------------------
After Christina moved from Illinois to California, the new information replaces the new record, and we have the
following table:
___________________________________________________________________
| Key | Name | Original State | Current State | Effective Date |
----------------------------------------------------------------------
| 1001 | Christina | Illinois | California | 15-JAN-2003 |
----------------------------------------------------------------------
🔴 Que 7) What are different types of facts in data warehouse ❓
➡ In data warehousing, facts are pieces of information that describe the business or organization's performance.
Facts are used to measure the performance of the organization or business, and they are generally expressed in
terms of numerical values. Data warehousing involves the organization and management of large amounts of
data, including facts.
➡ There are different types of facts in data warehousing. Let's explore some of them:
👉 Additive Facts:
Additive facts are those that can be added across different dimensions. For example, the total sales revenue can
be added across different time periods or different product lines. Additive facts are used for quantitative analysis.
👉 Semi-additive Facts:
Semi-additive facts are those that can be added across some dimensions but not all. For example, the balance of
a bank account can be added across different time periods but not across different account types. Semi-additive
facts are used for both quantitative and qualitative analysis.
👉 Non-additive Facts:
Non-additive facts are those that cannot be added across any dimension. For example, the average price of a
product cannot be added across different time periods or product lines. Non-additive facts are used for
qualitative analysis.
👉 Derived Facts:
Derived facts are those that are calculated from other facts. For example, the profit margin can be calculated by
subtracting the cost of goods sold from the sales revenue. Derived facts are used for both quantitative and
qualitative analysis.
👉 Factless Facts:
Factless facts are those that contain no measures or quantitative data. Factless facts are used to track the
occurrences of events. For example, a factless fact can be used to track the number of times a customer visited a
store but did not make a purchase.
➡ In conclusion, data warehousing involves managing and organizing large amounts of data, including facts.
There are different types of facts in data warehousing, and each serves a unique purpose. By understanding the
different types of facts, data warehousing professionals can build effective data models and make informed
decisions.
🔴 Que 8.) What are difference between OLAP & OLTP in data warehouse ❓
➡ In a data warehouse environment, there are two types of systems: Online Analytical Processing (OLAP) and
Online Transaction Processing (OLTP). While both systems play an essential role in a data warehouse, they are
designed to serve different purposes. In this post, we'll explore the difference between OLAP and OLTP in a data
warehouse.
➡ A schema is a logical structure that defines the organization and relationship of data in a database. In
data warehousing, there are three main types of schema: Star Schema, Snowflake Schema, and Galaxy
Schema. Each of these schemas has its own advantages and disadvantages, and the choice of schema
depends on the specific requirements of the organization.
➡ Star Schema
👉 The Star Schema is the most commonly used schema in data warehousing.
👉 It is a simple, denormalized schema that consists of one central fact table and multiple dimension
tables.
👉 The fact table contains the metrics and measures of interest, while the dimension tables contain the
descriptive information about the data.
👉 The fact table is connected to the dimension tables through foreign keys, and each dimension table is
independent of the others.
✅ Advantages:
Simple and easy to understand
Provides fast performance for queries
Requires less storage space compared to other schemas
❌ Disadvantages:
Redundant data can be stored in multiple dimensions
Limited scalability
➡ Snowflake Schema
👉 The Snowflake Schema is a normalized version of the Star Schema.
👉 It contains the same fact table and dimension tables as the Star Schema, but the dimension tables are
normalized into multiple tables.
👉 This reduces redundancy and improves data consistency.
✅ Advantages:
Reduces redundancy and improves data consistency
Better suited for large and complex data sets
Can accommodate changes in data structures more easily than Star Schema
❌ Disadvantages:
More complex to understand and maintain
Queries may take longer to execute due to the need to join multiple tables
➡ Galaxy Schema
👉 The Galaxy Schema is a hybrid schema that combines elements of both Star Schema and Snowflake
Schema.
👉 It is designed to handle complex and heterogeneous data sets, where some dimensions may be highly
normalized while others are not.
✅ Advantages:
Flexible and adaptable to changing data structures
Can handle complex and heterogeneous data sets
Provides fast performance for queries
❌ Disadvantages:
More complex to design and maintain than Star Schema
Can be difficult to understand due to the complexity of the schema
10.) What are data modeling and different levels of data modeling data warehouse ❓
➡ Data modeling is the process of creating a conceptual representation of data and defining the relationships
between different data elements. It is an essential step in the data warehouse development process. Data
modeling helps to ensure that the data warehouse is designed to support the needs of the organization and its
users.
➡ There are different levels of data modeling in a data warehouse. Each level of modeling provides a different
perspective on the data, and each level is important for designing an effective data warehouse.
➡ ETL, which stands for Extract, Transform, and Load, is a process that is commonly used in data warehousing to
integrate data from various sources into a single, consolidated destination. This process involves extracting data
from source systems, transforming the data to meet the requirements of the destination system, and loading the
transformed data into the destination system.
➡ The first step of the ETL process is the extraction of data from the source systems. This involves identifying the
relevant data sources and retrieving the data from them. This data can come from a variety of sources such as
databases, spreadsheets, flat files, or even external sources like APIs.
➡ The second step in the ETL process is the transformation of the data. This is where the extracted data is
cleaned, filtered, and formatted to meet the requirements of the destination system. This step is critical as the
source data may be in different formats and may have inconsistencies that need to be addressed before it can be
used in the destination system. Common data transformations include data conversion, data enrichment, and
data aggregation.
➡ Finally, the transformed data is loaded into the destination system, which could be a data warehouse, data
mart, or any other system designed to store and manage data. The destination system is often optimized for
reporting and analysis, which makes it easier for users to access and analyze the data.
➡ ETL is a critical component of any data integration or data warehousing project. It ensures that data from
different sources is integrated into a single, consolidated destination system, and that the data is transformed
and optimized for analysis and reporting. By using ETL, organizations can gain insights into their data, make
informed decisions, and improve business performance.
🔴 Que 12.) Explain the differences between database and data warehouse ❓
➡ In today's world, data is a crucial asset for any organization. It is used to make informed decisions and gain
insights into various aspects of the business. Two terms that are often used interchangeably but are
fundamentally different are "database" and "data warehouse". Let's dive into the differences between the two.
➡ What is a Database?
A database is a collection of related data organized in a structured format, designed to be accessed and
managed easily. Examples of databases include MySQL, Oracle, and Microsoft SQL Server.
Databases are optimized for transactional processing, meaning they are designed to handle frequent and fast
transactions such as inserting, updating, and deleting data.
➡ A data warehouse is a large and complex repository of data that is used by organizations to support their
decision-making processes. There are different types of data warehouses that are used depending on the nature
of the data and the requirements of the organization.
➡ Data Mart:
A data mart is a subset of the enterprise data warehouse that is designed to support the needs of a specific
department or business unit within an organization. Data marts are optimized for specific business processes and
are often designed using a bottom-up approach. They are typically less complex than the enterprise data
warehouse and are easier to manage.
➡ To design a data warehouse, several approaches are used, and in this post, we will discuss mainly two
approaches used for the data warehouse design:
➡ Staging layer: This layer stores the data extracted from multiple data structures.
➡ Data integration layer: The data from the staging layer transfers into the database with the help of the
integration layer. This data then gets organized into the hierarchical groups, also called dimensions, aggregates,
and facts. The dimensions and facts together form the schema.
➡ Access layer: End-users access the data through the access layer and perform the data analysis.