0% found this document useful (0 votes)
13 views44 pages

Bda U2

The document provides an overview of data warehousing, highlighting its role as a central component of business intelligence systems for data analysis and decision-making. It covers key characteristics, architecture, types, benefits, and the ETL process involved in data warehousing, as well as dimensional modeling techniques. Additionally, it discusses the hub and spoke architecture for data integration and reporting, emphasizing the importance of data consistency and accessibility for effective business outcomes.

Uploaded by

Ameryn Ameryn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views44 pages

Bda U2

The document provides an overview of data warehousing, highlighting its role as a central component of business intelligence systems for data analysis and decision-making. It covers key characteristics, architecture, types, benefits, and the ETL process involved in data warehousing, as well as dimensional modeling techniques. Additionally, it discusses the hub and spoke architecture for data integration and reporting, emphasizing the importance of data consistency and accessibility for effective business outcomes.

Uploaded by

Ameryn Ameryn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Big Data Analytics BIT 407

UNIT 2
Data Warehouse

• Data Warehouse:
✓Data Warehouse is basically the collection of data from various
heterogeneous sources.
✓It is the main component of the business intelligence system where
analysis and management of data are done which is further used to
improve decision making.
✓It involves the process of extraction, loading, and transformation for
providing the data for analysis.
✓Data warehouses are also used to perform queries on a large
amount of data.
✓It uses data from various relational databases and application log
files.
Big Data vs Data Warehouse
Big Data vs Data Warehouse
How does data warehousing relate to big data?

• These are two very different things in that, as a technology, big data
is a means to store and manage large volumes of data.
• On the other hand, a data warehouse is a set of software and
techniques that facilitate data collection and integration into a
centralized database.
Key Characteristics of Data Warehouse

The main characteristics of a data warehouse are as follows:


• Subject-Oriented
➢A data warehouse is subject-oriented since it provides topic-wise
information rather than the overall processes of a business.
➢Such subjects may be sales, promotion, inventory, etc.
➢For example, if you want to analyze your company’s sales data, you need
to build a data warehouse that concentrates on sales. Such a warehouse
would provide valuable information like ‘Who was your best customer last
year?’ or ‘Who is likely to be your best customer in the coming year?’
Key Characteristics of Data Warehouse

• Integrated
➢A data warehouse is developed by integrating data from varied
sources into a consistent format.
➢The data must be stored in the warehouse in a consistent and
universally acceptable manner in terms of naming, format, and
coding.
➢This facilitates effective data analysis.
Key Characteristics of Data Warehouse

• Non-Volatile
➢Data once entered into a data warehouse must remain
unchanged.
➢All data is read-only. Previous data is not erased when current
data is entered.
➢This helps you to analyze what has happened and when.
Key Characteristics of Data Warehouse

• Time-Variant
➢The data stored in a data warehouse is documented with an element of
time, either explicitly or implicitly.
➢An example of time variance in a Data Warehouse is exhibited in the
Primary Key, which must have an element of time like the day, week, or
month.
Data Warehouse Architecture

• Usually, data warehouse architecture comprises a three-tier structure.


Bottom Tier
The bottom tier or data warehouse server usually represents:
▪ a relational database system.
▪ Back-end tools are used to cleanse, transform and feed data into this
layer.
Data Warehouse Architecture

Middle Tier
The middle tier represents an OLAP server that can be implemented in
two ways.
▪ The ROLAP or Relational OLAP model is an extended relational
database management system that maps multidimensional data
processes to standard relational processes.
▪ The MOLAP or multidimensional OLAP directly acts on
multidimensional data and operations.
Data Warehouse Architecture
Top Tier
▪ This is the front-end client interface that gets data out from the data
warehouse.
▪ It holds various tools like query tools, analysis tools, reporting tools,
and data mining tools.
How Data Warehouse Works
• Data Warehousing integrates data and information collected from various
sources into one comprehensive database.
• For example, a data warehouse might combine customer information from
an organization’s point-of-sale systems, its mailing lists, website, and
comment cards.
• It might also incorporate confidential information about employees, salary
information, etc.
• Businesses use such components of data warehouses to analyze
customers.
• Data mining is one of the features of a data warehouse that involves
looking for meaningful data patterns in vast volumes of data and devising
innovative strategies for increased sales and profits.
Types of Data Warehouse
There are three main types of data warehouses.
Enterprise Data Warehouse (EDW)
• This type of warehouse serves as a key or central database that facilitates
decision-support services throughout the enterprise.
• The advantage of this type of warehouse is that it provides access to
cross-organizational information, offers a unified approach to data
representation, and allows running complex queries.
Types of Data Warehouse
Operational Data Store (ODS)
• This type of data warehouse refreshes in real time.
• It is often preferred for routine activities like storing employee records.
• It is required when data warehouse systems do not support reporting needs of
the business.

Data Mart
• A data mart is a subset of a data warehouse built to maintain a particular
department, region, or business unit.
• Every department of a business has a central repository or data mart to store
data.
• The data from the data mart is stored in the ODS (operational data store)
periodically. The ODS then sends the data to the EDW, where it is stored and
used.
Data Warehousing Tools
These tools help to collect, read, write, and transfer data from
various sources.
They are designed to support operations like data sorting,
filtering, merging, etc.
• Query and reporting tools
• Application Development tools
• Data mining tools
• OLAP tools
Benefits of Data Warehouse
There are several benefits of a data warehouse:

• Improved data consistency


• Better business decisions
• Easier access to enterprise data for end-users
• Better documentation of data
• Reduced computer costs and higher productivity
• Enabling end-users to ask ad-hoc queries or reports without deterring the
performance of operational systems
• Collection of related data from various sources into a place
Dimensional Modelling in Data Warehouse
• Dimensional Data Modeling is one of the data modeling techniques used
in data warehouse design.
• The concept of Dimensional Modeling is comprised of facts and
dimension tables.
• Since the main goal of this modeling is to improve the data retrieval so it
is optimized for SELECT OPERATION.
• The advantage of using this model is that we can store data in such a way
that it is easier to store and retrieve the data once stored in a data
warehouse.
• The dimensional model is the data model used by many OLAP systems.
Elements of Dimensional Data Model
• Facts
Facts are the measurable data elements that represent the business metrics of
interest. For example, in a sales data warehouse, the facts might include sales
revenue, units sold, and profit margins. Each fact is associated with one or more
dimensions, creating a relationship between the fact and the descriptive data.
• Dimension
Dimensions are the descriptive data elements that are used to categorize or
classify the data. For example, in a sales data warehouse, the dimensions might
include product, customer, time, and location. Each dimension is made up of a set
of attributes that describe the dimension.
• Attributes
Characteristics of dimension in data modeling are known as characteristics. These
are used to filter, search facts, etc. For a dimension of location, attributes can be
State, Country, Zipcode, etc.
Elements of Dimensional Data Model
• Fact Table
✓ A fact table is a primary table in dimension modeling.
✓ A Fact Table contains
1. Measurements/facts
2. Foreign key to the dimension table
• Dimension Table
• Dimensions of a fact are mentioned by the dimension table and they are basically joined
by a foreign key.
• Dimension tables are simply de-normalized tables. The dimensions can be having one or
more relationships.
Types of Dimensions in Data Warehouse
Model
• Conformed Dimension
• Outrigger Dimension
• Shrunken Dimension
• Role-Playing Dimension
• Dimension to Dimension Table
• Junk Dimension
• Degenerate Dimension
• Swappable Dimension
• Step Dimension
Steps to Create Dimensional Data
Modelling
Step-1: Identifying the business objective:
• The first step is to identify the business objective. Sales, HR,
Marketing, etc. are some examples of the needs of the
organization.
• Since it is the most important step of Data Modelling the
selection of business objectives also depends on the quality of
data available for that process.

Step-2: Identifying Granularity:


• Granularity is the lowest level of information stored in the table.
• The level of detail for business problems and its solution is
described by Grain.
Steps to Create Dimensional Data
Modelling
Step-3: Identifying Dimensions and their Attributes:
• Dimensions are objects or things. Dimensions categorize and describe data
warehouse facts and measures in a way that supports meaningful answers to
business questions.
• A data warehouse organizes descriptive attributes as columns in dimension
tables. For Example, the data dimension may contain data like a year, month,
and weekday.
Step-4: Identifying the Fact:
The measurable data is held by the fact table. Most of the fact table rows are
numerical values like price or cost per unit, etc.
Steps to Create Dimensional Data
Modelling
Step-5: Building of Schema:
• We implement the Dimension Model in this step.
• A schema is a database structure. There are two popular schemes: Star
Schema and Snowflake Schema.
1.Star Schema
• The star schema architecture is easy to design. It is called a star schema because the
diagram resembles a star, with points radiating from a center.
• The center of the star consists of the fact table, and the points of the star are
dimension tables.
2.Snowflake Schema
• The snowflake schema is an extension of the star schema. In a snowflake schema, each
dimension is normalized and connected to more dimension tables.
Advantages of Dimensional Data Modeling
• Simplified Data Access:
Dimensional data modeling enables users to easily access data through simple
queries, reducing the time and effort required to retrieve and analyze data.
• Enhanced Query Performance:
The simple structure of dimensional data modeling allows for faster query
performance, particularly when compared to relational data models.
• Increased Flexibility:
Dimensional data modeling allows for more flexible data analysis, as users can
quickly and easily explore relationships between data.
• Improved Data Quality:
Dimensional data modeling can improve data quality by reducing redundancy and
inconsistencies in the data.
• Easy to Understand:
Dimensional data modeling uses simple, intuitive structures that are easy to
understand, even for non-technical users.
Disadvantages of Dimensional Data
Modeling
• Limited Complexity: Dimensional data modeling may not be
suitable for very complex data relationships, as it relies on simple
structures to organize data.
• Limited Integration: Dimensional data modeling may not integrate
well with other data models, particularly those that rely on
normalization techniques.
• Limited Scalability: Dimensional data modeling may not be as
scalable as other data modeling techniques, particularly for very
large datasets.
• Limited History Tracking: Dimensional data modeling may not be
able to track changes to historical data, as it typically focuses on
current data.
Bus Architecture
• In the context of data warehousing, the term "bus architecture"
typically refers to the concept of a "data bus."
• The data bus architecture is used to manage the flow of data from
source systems to the data warehouse
Here's how the data bus architecture works in a data warehouse
context:
• Central Integration Point: The data bus serves as a central integration
point where data from various source systems is collected and
transformed before being loaded into the data warehouse. It acts as a
staging area for the data use in the dimensional model.
Bus Architecture
• Hub-and-Spoke Model: The data bus architecture resembles a hub-
and-spoke model. The "hub" represents the central data integration
point (the data bus), and the "spokes" represent the source systems
that feed data into the hub.
• Data Staging: Data from source systems is first extracted and loaded
into the data bus. This allows for standardization, transformation, and
cleansing of the data before it's further integrated into the data
warehouse.
• Decoupling Data Sources: The data bus architecture decouples the
data warehouse from individual source systems. This means that
changes in source systems don't directly impact the data warehouse's
structure. Instead, changes are managed within the data bus, and the
data warehouse is updated with consistent, transformed data.
Bus Architecture
• Dimensional Modeling: Once the data is transformed within the data
bus, it is then loaded into the data warehouse using a dimensional
modeling approach, such as star schema or snowflake schema. This
allows for efficient querying and reporting.
• Scalability and Flexibility: The data bus architecture can
accommodate new source systems more easily, making the data
warehouse architecture scalable and flexible. New data sources can
be integrated into the data bus without disrupting the existing data
flow.
• Data Consistency: By applying transformations and data quality
checks within the data bus, data consistency and integrity are
maintained before data is loaded into the data warehouse. This helps
ensure accurate reporting and analysis.
Bus Architecture
• Incremental Loading: The data bus architecture supports incremental
loading of data. Only changed or new data needs to be processed and
loaded into the data warehouse, reducing the processing load and
improving efficiency
Hub Architecture
• Data Hub Architectures are a collection of data and information from
multiple disparate sources for specific consumer decisions.
• The data that is collected can reside anywhere and in any format. The hub
consumes the necessary data, helping to remove the data noise and
improve performance for decisions.
• Data is integrated and organized efficiently, effectively, and economically
to support functional business outcomes.
• Data hubs can consume data from various sources such as data lakes. The
Data Hub Architecture that is built is dependent upon understanding the
consumers of the data and the decisions that need to be made and the
data sources themselves and how they relate to each other for the needs
of the business, essentially the business decision support across functional
units.
Hub Architecture
• The Data Hub Architecture has to consider the value chain between
all functions and data transitions that occur between these functions,
including automated data decision activities found in artificial
intelligence (AI) and machine learning (ML) capabilities.
• Open data hubs help address different criteria for data access needed
by different people in the organization.
• Since every function and role consumes data differently and
sometimes in the same manner. Open data hub also helps with the
usage of hybrid-cloud infrastructures, collaboration, and integration
between various functional teams.
Hub Architecture
• The following data hub architecture diagram dictates a simple hub and
spoke perspective between the data and the data consumer.
• The architecture can contain multiple data hubs that are fit for purpose for
the consumers of the data.
• This helps with performance and overall understanding of how data is
used to make decisions within the organization.
• Inputs to the data hub can come from data warehouses, SharePoint, other
data silos, and anywhere data resides.
• The critical consideration here is that when building a data hub is
important to be specific with the data choice needed for the consumers’
decisions.
• Otherwise, too much data to manage will decrease performance and
increase complexity.
Hub Architecture
• Data provides answers in the form of the
data itself or transformation with other
data into information and knowledge for
the consumer of the data.
• Data consumers can be people or
automated sources such as in machine
learning and artificial intelligence.
• In either case, the organization should
realize that applying a data hub is a
continuous improvement initiative for
supporting high-performing business
outcomes.
Spoke Architecture
• In the context of data warehousing, the "spoke architecture" is a concept
that complements the hub architecture, which I previously explained.
• The spoke architecture is an extension of the hub architecture, and
together they form a comprehensive data integration and reporting
system.
• The spoke architecture is particularly relevant when dealing with specific
business areas or departments that have unique data requirements.
• In the hub-and-spoke architecture, the "hub" represents the centralized
data warehouse, and the "spokes" are individual data marts that cater to
the needs of specific business areas or user groups.
• Each spoke is a subset of the data warehouse, containing data that is
relevant to a particular department or function.
ETL Process in Data Warehouse
ETL stands for Extract, Transform, Load and it is a process used
in data warehousing to extract data from various sources,
transform it into a format suitable for loading into a data
warehouse, and then load it into the warehouse.
The process of ETL can be broken down into the following three
stages:
• Extract data from legacy systems
• Cleanse the data to improve data quality and establish consistency
• Load data into a target database
ETL Process in Data Warehouse
1.Extract: The first stage in the ETL process is to
extract data from various sources such as
transactional systems, spreadsheets, and flat
files. This step involves reading data from the
source systems and storing it in a staging area.

2.Transform: In this stage, the extracted data is


transformed into a format that is suitable for
loading into the data warehouse. This may
involve cleaning and validating the data,
converting data types, combining data from
multiple sources, and creating new data fields.

3.Load: After the data is transformed, it is


loaded into the data warehouse. This step
involves creating the physical data structures
and loading the data into the warehouse.
ETL Process in Data Warehouse

The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data
warehouse is accurate, complete, and up-to-date. It also helps to ensure that the data
is in the format required for data mining and reporting.
How ETL works
Extract
During data extraction, raw data is copied or exported from source locations
to a staging area. Data management teams can extract data from a variety of
data sources, which can be structured or unstructured. Those sources
include but are not limited to:
• SQL or NoSQL servers
• CRM and ERP systems
• Flat files
• Email
• Web pages
How ETL works
Transform
The second step of the ETL process is transformation. In this step, a set of
rules or functions are applied to the extracted data to convert it into a single
standard format. It may involve the following processes/tasks: Filtering –
loading only certain attributes into the data warehouse.
• Cleaning – filling up the NULL values with some default values, mapping
U.S.A, United States, and America into USA, etc.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute (generally key-
attribute).
How ETL works
Loading
• The third and final step of the ETL process is loading. In this step, the transformed
data is finally loaded into the data warehouse.
• Sometimes the data is updated by loading into the data warehouse very
frequently and sometimes it is done after longer but regular intervals.
• The rate and period of loading solely depend on the requirements and vary from
system to system.
ETL Tools
The most commonly used ETL tools are:
1. Hevo
2. Sybase
3. Oracle Warehouse builder
4. CloverETL
5. MarkLogic.
Advantages of the ETL process in data warehousing:

1.Improved data quality: The ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
2.Better data integration: The ETL process helps to integrate data from
multiple sources and systems, making it more accessible and usable.
3.Increased data security: The ETL process can help to improve data
security by controlling access to the data warehouse and ensuring that
only authorized users can access the data.
4.Improved scalability: The ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
5.Increased automation: ETL tools and technologies can automate and
simplify the ETL process, reducing the time and effort required to load
and update data in the warehouse.
Disadvantages of ETL process in data warehousing:

1.High cost: The ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
2.Complexity: The ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or
resources.
3.Limited flexibility: The ETL process can be limited in terms of flexibility,
as it may not be able to handle unstructured data or real-time data
streams.
4.Limited scalability: ETL process can be limited in terms of scalability, as
it may not be able to handle very large amounts of data.
5.Data privacy concerns: ETL process can raise concerns about data
privacy, as large amounts of data are collected, stored, and analyzed.

You might also like