0% found this document useful (0 votes)
15 views17 pages

Data Warehousing INTERVIEW QUESTION

Uploaded by

mahima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views17 pages

Data Warehousing INTERVIEW QUESTION

Uploaded by

mahima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Data Warehousing

INTERVIEW
PREPERATION SERIES
( ALL SERIES IN SINGLE DOCUMENT )
🔴 Que 1.) What is a data warehouse ❓

➡ A data warehouse is a centralized repository that stores large amounts of structured and unstructured data
from various sources. ➡ It is designed to support business intelligence activities such as data analysis, reporting,
and decision-making.
➡ A data warehouse typically contains historical data that has been extracted, transformed, and loaded from
various transactional systems.
➡ Its purpose is to provide a single source of truth for the organization, enabling users to make informed
decisions based on reliable data.
➡ Data warehouses are optimized for query and analysis performance and can support advanced analytics
techniques such as predictive analytics and machine learning.
➡ A data warehouse is a Subject-Oriented, Integrated, Time-Variant and Non-Volatile collection of data in
support of management's decision making process.
🔴 Que 2.) What is a dimension table and a fact table in data warehouse ❓

➡ In data warehousing, a dimension table is a table that contains descriptive attributes that can be used to filter
and group data in a fact table. A fact table, on the other hand, is a table that contains quantitative measures or
facts that can be analyzed using the dimensions in a dimension table.

➡ Dimension Table:
👉 A dimension table is a table that contains descriptive attributes that provide context for the data in a fact
table. These attributes are used to filter and group data in the fact table.
👉 For example, in a sales data warehouse, a dimension table might contain information about customers,
products, sales channels, and time periods.
👉 Each record in the dimension table represents a unique combination of attributes that define a particular
entity, such as a customer or a product.
👉 Dimension tables are used to describe dimensions; i.e. they contain primary keys, and the detailed values and
attributes related to Dimensions. Without having the dimensions using fact table is meaningless.

➡ Fact Table:
👉 A fact table is a table that contains quantitative measures or facts that can be analyzed using the dimensions
in a dimension table.
👉 These facts can be numeric values such as sales revenue, quantities sold, or profit margins. Each record in a
fact table represents a unique combination of dimensions and associated measures.
👉 Fact tables contain Foreign Keys referring to Dimension tables where descriptive information is kept as well
as measurable facts that Data Analysts would want to examine.
👉 Fact Tables are designed to a low level of uniform detail (referred to as "Granularity" or "Grain"), i.e. Facts can
record events at a very atomic level. This can result in the accumulation of a large number of records in a fact
table over a period of time.
🔴 Que 3.) What are different types of dimension in data warehouse ❓

➡ In the world of data warehousing, dimensions refer to the attributes that describe a particular aspect of a
business. A dimension can be used to organize data in a data warehouse and is often associated with a hierarchy
of attributes. Here are some types of dimensions that are commonly used in data warehousing:

👉 Slowly Changing Dimension (SCD): As the name suggests, a slowly changing dimension is a type of
dimension that changes slowly over time. It is used to track changes in historical data. For example, a customer's
address or name may change over time, and an SCD is used to keep track of these changes while maintaining
historical records. There are different types of SCDs, including type 1, type 2, and type 3, each with its own
approach to handling changes in data.
👉 Fast Changing Dimension: In contrast to SCDs, fast-changing dimensions are those that change frequently.
These dimensions are not tracked historically but instead are updated regularly to reflect the current state of the
business. Examples of fast-changing dimensions include inventory levels or stock prices.
👉 Role Playing Dimension: A role-playing dimension is a dimension that is used in multiple ways within the
same fact table. For example, a date dimension can be used to track sales by day, week, or month, and it can also
be used to track production schedules or employee schedules. Each usage of the dimension is called a role, and
the dimension is said to be playing multiple roles.
👉 Garbage/Junk Dimension: A garbage or junk dimension is a dimension that is used to store miscellaneous or
non-specific data that doesn't fit into any other dimension. For example, a product description may contain
keywords that are not tracked in any other dimension, such as color or size. These keywords can be stored in a
junk dimension, allowing them to be analyzed and grouped together with other similar keywords.

➡ Each of these dimensions plays a specific role in organizing and analyzing data in a data warehouse. By
understanding these different types of dimensions, businesses can design a data warehouse that meets their
specific needs and provides valuable insights into their operations.
🔴 Que 4.) Explain Slowly Changing Dimension 1 (SCD1) in data warehouse ❓

➡ Slowly Changing Dimensions (SCD) are an essential aspect of data warehousing, where data changes over
time, but still needs to be tracked and analyzed. Type 1 SCD is the simplest and easiest approach to handle
dimension changes. It involves overwriting the old data with the new data without maintaining a history of
changes.

➡ Type 1 SCD is suitable for dimensions that don't change frequently, and where historical data isn't important.
For example, if a product's name changes, we can overwrite the old name with the new name without creating a
new record. This approach is also suitable for dimensions where the data is volatile, and only the most recent
data is required.

➡ The advantages of using Type 1 SCD are its simplicity, low storage requirements, and faster query
performance. It requires fewer database resources and is easy to implement. Since there is no need to create new
records, it saves storage space, and query performance is faster as there are fewer records to search through.

➡ However, the disadvantage of using Type 1 SCD is that it overwrites the old data, which means that the
historical data is lost. This can be problematic if there is a need to track the changes in the dimension data over
time. For example, if the price of a product changes frequently, we cannot track the price changes using Type 1
SCD.

➡ In conclusion, Type 1 SCD is a simple and easy approach to handle dimension changes in data warehousing.
It's suitable for dimensions that don't change frequently, and where historical data isn't important. However, if
we need to track the changes in the dimension data over time, we should consider using Type 2 or Type 3 SCD.
It's essential to choose the appropriate SCD approach based on the specific needs of the project.

➡ Example:
In our example, we originally have the following table:
____________________________
| Key | Name | State |
-----------------------------
| 1001 | Christina | Illinois |
-----------------------------

After Christina moved from Illinois to California, the new information replaces the new record, and we have the
following table:
_______________________________
| Key | Name | State |
--------------------------------
| 1001 | Christina | California |
--------------------------------
🔴 Que 5.) Explain Slowly Changing Dimension 2 (SCD2) in data warehouse ❓

➡ Type 2 SCD is the most commonly used approach to handle dimension changes. It involves creating a new
record for each change in dimension data and preserving the old data by assigning a new unique identifier.
➡ Type 2 SCD is suitable for dimensions that change frequently and require a history of changes. For example, if
a customer changes their address, we create a new record with a new unique identifier to preserve the old
address. This approach allows us to track changes in the dimension data over time and provides a full history of
the data.
➡ The advantages of using Type 2 SCD are that it allows us to track changes in dimension data over time and
provides a complete history of the data. It also preserves the old data, which is essential for analysis and
reporting. Type 2 SCD is more flexible than Type 1 SCD, as it can handle dimensions with more changes over
time.
➡ However, the disadvantage of using Type 2 SCD is that it requires more storage space than Type 1 SCD, as we
need to create new records for every change in the dimension data. This approach can also slow down query
performance, as there are more records to search through.
➡In conclusion, Type 2 SCD is the most commonly used approach to handle dimension changes in data
warehousing. It's suitable for dimensions that change frequently and require a history of changes. Type 2 SCD
allows us to track changes in the dimension data over time and provides a complete history of the data.
However, we should be aware that it requires more storage space and can slow down query performance. It's
essential to choose the appropriate SCD approach based on the specific needs of the project.

➡ Example:
In our example, we originally have the following table:
____________________________
| Key | Name | State |
-----------------------------
| 1001 | Christina | Illinois |
-----------------------------

After Christina moved from Illinois to California, the new information replaces the new record, and we have the
following table:
_______________________________
| Key | Name | State |
--------------------------------
| 1001 | Christina | Illinois |
--------------------------------
| 1005 | Christina | California |
--------------------------------
🔴 Que 6.) Explain Slowly Changing Dimension 3 (SCD3) in data warehouse ❓

➡ Type 3 SCD is a hybrid approach that combines elements of both Type 1 and Type 2 SCD. It involves creating
separate columns for the current and previous versions of the data.
➡ Type 3 SCD is suitable for dimensions where only the current and previous versions of the data are important.
For example, if a product's price changes frequently, we can create separate columns for the current and
previous prices. This approach allows us to track the changes in the dimension data over time and provides some
history of the data.
➡ The advantages of using Type 3 SCD are that it requires less storage space than Type 2 SCD, as we only need
to store the current and previous versions of the data. It's also faster than Type 2 SCD, as there are fewer records
to search through. Type 3 SCD is more flexible than Type 1 SCD, as it can handle dimensions with some changes
over time.
➡ However, the disadvantage of using Type 3 SCD is that it's limited in the history of changes it can track. It only
provides some history of the data, as we only store the current and previous versions of the data. This approach
is also less flexible than Type 2 SCD, as it can't handle dimensions with a lot of changes over time.
➡ In conclusion, Type 3 SCD is a hybrid approach that combines elements of both Type 1 and Type 2 SCD. It's
suitable for dimensions where only the current and previous versions of the data are important. Type 3 SCD
requires less storage space than Type 2 SCD, and it's faster than Type 2 SCD. However, it's limited in the history
of changes it can track and less flexible than Type 2 SCD. It's essential to choose the appropriate SCD approach
based on the specific needs of the project.

➡ Example:
In our example, we originally have the following table:
____________________________
| Key | Name | State |
-----------------------------
| 1001 | Christina | Illinois |
-----------------------------

After Christina moved from Illinois to California, the new information replaces the new record, and we have the
following table:
___________________________________________________________________
| Key | Name | Original State | Current State | Effective Date |
----------------------------------------------------------------------
| 1001 | Christina | Illinois | California | 15-JAN-2003 |
----------------------------------------------------------------------
🔴 Que 7) What are different types of facts in data warehouse ❓

➡ In data warehousing, facts are pieces of information that describe the business or organization's performance.
Facts are used to measure the performance of the organization or business, and they are generally expressed in
terms of numerical values. Data warehousing involves the organization and management of large amounts of
data, including facts.

➡ There are different types of facts in data warehousing. Let's explore some of them:

👉 Additive Facts:
Additive facts are those that can be added across different dimensions. For example, the total sales revenue can
be added across different time periods or different product lines. Additive facts are used for quantitative analysis.
👉 Semi-additive Facts:
Semi-additive facts are those that can be added across some dimensions but not all. For example, the balance of
a bank account can be added across different time periods but not across different account types. Semi-additive
facts are used for both quantitative and qualitative analysis.
👉 Non-additive Facts:
Non-additive facts are those that cannot be added across any dimension. For example, the average price of a
product cannot be added across different time periods or product lines. Non-additive facts are used for
qualitative analysis.
👉 Derived Facts:
Derived facts are those that are calculated from other facts. For example, the profit margin can be calculated by
subtracting the cost of goods sold from the sales revenue. Derived facts are used for both quantitative and
qualitative analysis.
👉 Factless Facts:
Factless facts are those that contain no measures or quantitative data. Factless facts are used to track the
occurrences of events. For example, a factless fact can be used to track the number of times a customer visited a
store but did not make a purchase.

➡ In conclusion, data warehousing involves managing and organizing large amounts of data, including facts.
There are different types of facts in data warehousing, and each serves a unique purpose. By understanding the
different types of facts, data warehousing professionals can build effective data models and make informed
decisions.
🔴 Que 8.) What are difference between OLAP & OLTP in data warehouse ❓

➡ In a data warehouse environment, there are two types of systems: Online Analytical Processing (OLAP) and
Online Transaction Processing (OLTP). While both systems play an essential role in a data warehouse, they are
designed to serve different purposes. In this post, we'll explore the difference between OLAP and OLTP in a data
warehouse.

➡ OLTP (Online Transaction Processing)


👉 OLTP is a system designed to manage and process real-time transactional data in a business. It is used to
support day-to-day operations, such as order processing, inventory management, and online banking
transactions. OLTP is characterized by a high volume of transactions, short response times, and a low level of
complexity.
👉 Data in an OLTP system is highly normalized to avoid redundancy and ensure data consistency. The OLTP
system typically uses a relational database management system (RDBMS) to store and manage data.

➡ OLAP (Online Analytical Processing)


👉 OLAP is a system designed to perform complex analytical queries on large sets of historical data. The
purpose of OLAP is to provide insight into business performance over time, enabling organizations to make
informed decisions based on data-driven insights. OLAP is characterized by a low volume of transactions, longer
response times, and a high level of complexity.
👉 Data in an OLAP system is typically denormalized to improve query performance, and the system uses
multidimensional databases to store and manage data.

➡ Key Differences between OLAP and OLTP


👉 Purpose: OLTP is designed for real-time transaction processing, while OLAP is designed for complex
analytical queries on large sets of historical data.
👉 Data Volume: OLTP handles a high volume of transactions, while OLAP handles a low volume of transactions
but deals with large amounts of data.
👉 Data Structure: OLTP uses a highly normalized data structure to ensure data consistency, while OLAP uses a
denormalized data structure to improve query performance.
👉 Query Response Time: OLTP provides a short response time, while OLAP provides a longer response time due
to the complexity of queries.
👉 Users: OLTP is designed for transactional users, such as clerks, while OLAP is designed for analytical users,
such as business analysts and data scientists.
🔴 Que 9.) What are different types of schemas in data warehouse ❓

➡ A schema is a logical structure that defines the organization and relationship of data in a database. In
data warehousing, there are three main types of schema: Star Schema, Snowflake Schema, and Galaxy
Schema. Each of these schemas has its own advantages and disadvantages, and the choice of schema
depends on the specific requirements of the organization.

➡ Star Schema
👉 The Star Schema is the most commonly used schema in data warehousing.
👉 It is a simple, denormalized schema that consists of one central fact table and multiple dimension
tables.
👉 The fact table contains the metrics and measures of interest, while the dimension tables contain the
descriptive information about the data.
👉 The fact table is connected to the dimension tables through foreign keys, and each dimension table is
independent of the others.

✅ Advantages:
Simple and easy to understand
Provides fast performance for queries
Requires less storage space compared to other schemas

❌ Disadvantages:
Redundant data can be stored in multiple dimensions
Limited scalability

➡ Snowflake Schema
👉 The Snowflake Schema is a normalized version of the Star Schema.
👉 It contains the same fact table and dimension tables as the Star Schema, but the dimension tables are
normalized into multiple tables.
👉 This reduces redundancy and improves data consistency.

✅ Advantages:
Reduces redundancy and improves data consistency
Better suited for large and complex data sets
Can accommodate changes in data structures more easily than Star Schema

❌ Disadvantages:
More complex to understand and maintain
Queries may take longer to execute due to the need to join multiple tables
➡ Galaxy Schema
👉 The Galaxy Schema is a hybrid schema that combines elements of both Star Schema and Snowflake
Schema.
👉 It is designed to handle complex and heterogeneous data sets, where some dimensions may be highly
normalized while others are not.

✅ Advantages:
Flexible and adaptable to changing data structures
Can handle complex and heterogeneous data sets
Provides fast performance for queries

❌ Disadvantages:
More complex to design and maintain than Star Schema
Can be difficult to understand due to the complexity of the schema
10.) What are data modeling and different levels of data modeling data warehouse ❓

➡ Data modeling is the process of creating a conceptual representation of data and defining the relationships
between different data elements. It is an essential step in the data warehouse development process. Data
modeling helps to ensure that the data warehouse is designed to support the needs of the organization and its
users.
➡ There are different levels of data modeling in a data warehouse. Each level of modeling provides a different
perspective on the data, and each level is important for designing an effective data warehouse.

➡ Conceptual Data Modeling


👉 Conceptual data modeling is the highest level of modeling and focuses on the overall view of the data.
👉 It defines the business entities, their attributes, and the relationships between them.
👉 It is a non-technical representation of the data, which helps to ensure that the data warehouse is designed to
meet the business needs.
👉 Conceptual data modeling is typically done by business analysts and subject matter experts.
👉 At this level, the Data Modeler attempts to identify the important Entities and the Relationship among them.

➡ Logical Data Modeling


👉 Logical data modeling is the next level of modeling and is more technical than conceptual modeling.
👉 It defines the data elements and the relationships between them. Logical data modeling is focused on the
data structures and how they can be used to support business requirements.
👉 Logical data modeling is typically done by data architects and data modelers.
👉 At this level, the Data Modeler attempts to describe the data in detail as possible, without knowing how they
will be physically implemented in the database.

➡ Physical Data Modeling


👉 Physical data modeling is the most detailed level of modeling and is focused on the implementation of the
data warehouse.
👉 It defines the physical storage structures, such as tables, columns, and indexes.
👉 Physical data modeling is important for optimizing performance and ensuring that the data warehouse can
scale to meet the needs of the organization.
👉 At this level, the Data Modeler will specify how the Logical Data Model will be realized in the database
schema. A physical
database model shows all Table Structures, including Column Name, Column Data Type, Column Constraints,
Primary Key, Foreign Key, and Relationships between Tables.
🔴 Que 11.) What is ETL ❓

➡ ETL, which stands for Extract, Transform, and Load, is a process that is commonly used in data warehousing to
integrate data from various sources into a single, consolidated destination. This process involves extracting data
from source systems, transforming the data to meet the requirements of the destination system, and loading the
transformed data into the destination system.

➡ The first step of the ETL process is the extraction of data from the source systems. This involves identifying the
relevant data sources and retrieving the data from them. This data can come from a variety of sources such as
databases, spreadsheets, flat files, or even external sources like APIs.

➡ The second step in the ETL process is the transformation of the data. This is where the extracted data is
cleaned, filtered, and formatted to meet the requirements of the destination system. This step is critical as the
source data may be in different formats and may have inconsistencies that need to be addressed before it can be
used in the destination system. Common data transformations include data conversion, data enrichment, and
data aggregation.

➡ Finally, the transformed data is loaded into the destination system, which could be a data warehouse, data
mart, or any other system designed to store and manage data. The destination system is often optimized for
reporting and analysis, which makes it easier for users to access and analyze the data.

➡ ETL is a critical component of any data integration or data warehousing project. It ensures that data from
different sources is integrated into a single, consolidated destination system, and that the data is transformed
and optimized for analysis and reporting. By using ETL, organizations can gain insights into their data, make
informed decisions, and improve business performance.
🔴 Que 12.) Explain the differences between database and data warehouse ❓

➡ In today's world, data is a crucial asset for any organization. It is used to make informed decisions and gain
insights into various aspects of the business. Two terms that are often used interchangeably but are
fundamentally different are "database" and "data warehouse". Let's dive into the differences between the two.

➡ What is a Database?
A database is a collection of related data organized in a structured format, designed to be accessed and
managed easily. Examples of databases include MySQL, Oracle, and Microsoft SQL Server.
Databases are optimized for transactional processing, meaning they are designed to handle frequent and fast
transactions such as inserting, updating, and deleting data.

➡ What is a Data Warehouse?


A data warehouse is a large, centralized repository of data that is used to support decision-making activities. It is
a system that is designed to store and manage large volumes of data from different sources in a way that
enables efficient querying and analysis. Data warehouses are typically used for reporting, analysis, and data
mining purposes. Unlike databases, data warehouses are optimized for analytical processing.

➡ Key Differences between Database and Data Warehouse


👉 Purpose: The purpose of a database is to support the operations of a single application or business process.
The purpose of a data warehouse is to support decision-making activities by providing a central repository of
data for analysis and reporting.
👉 Design: Databases are designed for transactional processing, meaning they are optimized for fast and
frequent transactions. Data warehouses are designed for analytical processing, meaning they are optimized for
complex queries and data analysis.
👉 Data Structure: Databases often store normalized data, which means that redundant data is removed to
minimize storage needs. Data warehouses often store denormalized data, which means that redundant data is
intentionally included to simplify queries and improve query performance.
👉 Data Sources: Databases typically store data from a single application or business process. Data warehouses
store data from multiple sources, including databases, flat files, and other data sources.
👉 Data Volume: Databases are designed to handle moderate volumes of data. Data warehouses are designed
to handle large volumes of data.
🔴 Que 13.) Explain the types of a data warehouse❓

➡ A data warehouse is a large and complex repository of data that is used by organizations to support their
decision-making processes. There are different types of data warehouses that are used depending on the nature
of the data and the requirements of the organization.

➡ Enterprise Data Warehouse (EDW):


The enterprise data warehouse is a centralized repository that stores data from various sources within an
organization. The EDW is designed to support the needs of the entire organization, and it serves as the single
source of truth for all enterprise data. The EDW is typically designed using a top-down approach and is
optimized for complex analytical queries.

➡ Operational Data Store (ODS):


An operational data store is a database that stores real-time operational data from various sources within an
organization. The ODS is designed to support operational processes and is optimized for transactional
processing. The data in an ODS is often cleaned and transformed before it is loaded into the enterprise data
warehouse.

➡ Data Mart:
A data mart is a subset of the enterprise data warehouse that is designed to support the needs of a specific
department or business unit within an organization. Data marts are optimized for specific business processes and
are often designed using a bottom-up approach. They are typically less complex than the enterprise data
warehouse and are easier to manage.

➡ Virtual Data Warehouse:


A virtual data warehouse is a logical view of data that is created by combining data from various sources within
an organization. The virtual data warehouse is designed to provide a unified view of data to users without
physically moving the data into a single repository. This type of data warehouse is often used when it is not
feasible to physically move data into a central repository.

➡ Cloud Data Warehouse:


A cloud data warehouse is a type of data warehouse that is hosted on a cloud platform such as Amazon Web
Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure. Cloud data warehouses are scalable and can
be quickly provisioned, making them a popular choice for organizations that need to process large volumes of
data.
🔴 Que 14.) What approaches are used to design the data warehouse❓

➡ To design a data warehouse, several approaches are used, and in this post, we will discuss mainly two
approaches used for the data warehouse design:

➡ Inmon approach (Top-Down Approach)


👉 The top-down approach is a traditional approach to designing a data warehouse.
👉 It is the top-down approach in which first the data warehouse gets created, and then the data marts are built.
👉 In this approach, the data warehouse acts as the center of the Corporate Information Factory, and the data
warehouse acts as a logical framework.
👉 It involves starting with the overall business requirements and then breaking them down into individual
functional areas.
👉 Each functional area is then analyzed to determine the specific data needs, and a data model is created to
represent the data.

➡ Kimball approach (Bottom-Up Approach)


👉 The bottom-up approach, on the other hand, involves starting with the existing data sources and building a
data warehouse incrementally.
👉 It is the bottom-up approach in which data mart gets created first. The data mart then integrates to form the
complete data warehouse.
👉 The integration of different data marts is called the data warehouse bus architecture.
👉 This approach is ideal for organizations that have multiple disparate data sources, and it involves integrating
these sources into a single data warehouse.
🔴 Que 15.) Explain the 3 layer architecture of the ETL cycle❓

➡ The ETL cycle consists of below 3 layers:

➡ Staging layer: This layer stores the data extracted from multiple data structures.

➡ Data integration layer: The data from the staging layer transfers into the database with the help of the
integration layer. This data then gets organized into the hierarchical groups, also called dimensions, aggregates,
and facts. The dimensions and facts together form the schema.

➡ Access layer: End-users access the data through the access layer and perform the data analysis.

You might also like