0% found this document useful (0 votes)
35 views17 pages

U2 - Hub Spoke

The document discusses hub-and-spoke and bus architectures in data warehousing, highlighting their roles in managing data flow and integration. It explains the ETL (Extract, Transform, Load) process, detailing its stages and the advantages and disadvantages of using ETL in data warehousing. Overall, it emphasizes the importance of these architectures and processes for efficient data management and analysis.

Uploaded by

Ameryn Ameryn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views17 pages

U2 - Hub Spoke

The document discusses hub-and-spoke and bus architectures in data warehousing, highlighting their roles in managing data flow and integration. It explains the ETL (Extract, Transform, Load) process, detailing its stages and the advantages and disadvantages of using ETL in data warehousing. Overall, it emphasizes the importance of these architectures and processes for efficient data management and analysis.

Uploaded by

Ameryn Ameryn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Big Data Analytics Unit 2

Hub and Spoke Architecture


● A hub-and-spoke data warehouse architecture is a type of data warehouse
architecture that is composed of a central hub, which is typically a relational
database, and a number of spokes, which are usually OLAP cubes or data
marts.
● A hub and spoke is any process in which a wheel of a bicycle is used to move
along a path (referred to as a spoke in a bicycle). In the logistics industry, a
hub and spoke distribution model is used to move inventory from a large
distribution center to multiple fulfillment centers.
● In the hub and spoke architecture, the hub serves as the centralized broker,
while the spoke serves as an adapter that connects applications to the hub.
● The spoke establishes a connection with an application and converts
application data into a format that the hub understands.
● A hub and spoke data model is a type of data model that is used to organize
data in a way that is easy to understand and use. This type of data model is
often used in databases and software applications.
● The hub and spoke data model is made up of a central hub, which is
surrounded by a number of spokes. Each spoke represents a different piece
of data. The hub and spoke data model is easy to use because it is easy to
understand how the data is arranged.
● This type of architecture is often used in organizations that have a large
amount of data to warehouse.
● The hub-and-spoke architecture allows the organization to keep the data in
one central location, while still providing access to the data for reporting and
analysis.
● The hub and spoke models both use virtual networks to manage external
connectivity and hosting services used by multiple workloads. On virtual
networks, workloads are hosted and linked to the central hub through virtual
network peering.
Bus Architecture
•In the context of data warehousing, the term "bus architecture" typically
refers to the concept of a "data bus."
•The data bus architecture is used to manage the flow of data from source
systems to the data warehouse
Here's how the data bus architecture works in a data warehouse context:
•Central Integration Point: The data bus serves as a central integration
point where data from various source systems is collected and transformed
before being loaded into the data warehouse. It acts as a staging area for
the data use in the dimensional model.
• Hub-and-Spoke Model: The data bus architecture resembles a hub-and-spoke
model. The "hub" represents the central data integration point (the data bus), and the
"spokes" represent the source systems that feed data into the hub.
•Data Staging: Data from source systems is first extracted and loaded into the data
bus. This allows for standardization, transformation, and cleansing of the data before
it's further integrated into the data warehouse.
•Decoupling Data Sources: The data bus architecture decouples the data
warehouse from individual source systems. This means that changes in source
systems don't directly impact the data warehouse's structure. Instead, changes are
managed within the data bus, and the data warehouse is updated with consistent,
transformed data.
•Dimensional Modeling: Once the data is transformed within the data bus, it is then loaded into the data
warehouse using a dimensional modeling approach, such as star schema or snowflake schema. This allows
for efficient querying and reporting.
•Scalability and Flexibility: The data bus architecture can accommodate new source systems more easily,
making the data warehouse architecture scalable and flexible. New data sources can be integrated into the
data bus without disrupting the existing data flow.
•Data Consistency: By applying transformations and data quality checks within the data bus, data
consistency and integrity are maintained before data is loaded into the data warehouse. This helps ensure
accurate reporting and analysis.
•Incremental Loading: The data bus architecture supports incremental loading of data. Only changed or
new data needs to be processed and loaded into the data warehouse, reducing the processing load and
improving efficiency
ETL Process in Data Warehouse

ETL stands for Extract, Transform, Load and it is a process used in


data warehousing to extract data from various sources, transform it
into a format suitable for loading into a data warehouse, and then
load it into the warehouse.
The process of ETL can be broken down into the following three
stages:
•Extract data from legacy systems
•Cleanse the data to improve data quality and establish consistency
•Load data into a target database
ETL Process in Data Warehouse
ETL Process in Data Warehouse

1.Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.

2.Transform: In this stage, the extracted data is transformed into a format that is suitable for
loading into the data warehouse. This may involve cleaning and validating the data, converting
data types, combining data from multiple sources, and creating new data fields.

3.Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.
ETL Process in Data Warehouse

The ETL process is an iterative process that is repeated as new


data is added to the warehouse. The process is important because
it ensures that the data in the data warehouse is accurate,
complete, and up-to-date. It also helps to ensure that the data is
in the format required for data mining and reporting.
How ETL works

Extract
During data extraction, raw data is copied or exported from source locations to a
staging area. Data management teams can extract data from a variety of data
sources, which can be structured or unstructured. Those sources include but are
not limited to:
•SQL or NoSQL servers
•CRM and ERP systems
•Flat files
•Email
•Web pages
How ETL works

Transform
The second step of the ETL process is transformation. In this step, a set of rules
or functions are applied to the extracted data to convert it into a single standard
format. It may involve the following processes/tasks: Filtering – loading only
certain attributes into the data warehouse.
•Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States, and America into USA, etc.
•Joining – joining multiple attributes into one.
•Splitting – splitting a single attribute into multiple attributes.
•Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
How ETL works

Loading
•The third and final step of the ETL process is loading. In this step,
the transformed data is finally loaded into the data warehouse.
•Sometimes the data is updated by loading into the data
warehouse very frequently and sometimes it is done after longer
but regular intervals.
•The rate and period of loading solely depend on the requirements
and vary from system to system.
ETL Tools
The most commonly used ETL tools are:
1. Hevo
2.Sybase
3.Oracle Warehouse builder
4.CloverETL
5.MarkLogic.
Advantages of the ETL process in data warehousing:

1.Improved data quality: The ETL process ensures that the data in the data warehouse
is accurate, complete, and up-to-date.
2.Better data integration: The ETL process helps to integrate data from multiple
sources and systems, making it more accessible and usable.
3.Increased data security: The ETL process can help to improve data security by
controlling access to the data warehouse and ensuring that only authorized users can
access the data.
4.Improved scalability: The ETL process can help to improve scalability by providing a
way to manage and analyze large amounts of data.
5.Increased automation: ETL tools and technologies can automate and simplify the
ETL process, reducing the time and effort required to load and update data in the
warehouse.
Disadvantages of ETL process in data warehousing:

1.High cost: ETL process can be expensive to implement and maintain,


especially for organizations with limited resources.
2.Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or resources.
3.Limited flexibility: ETL process can be limited in terms of flexibility, as it may
not be able to handle unstructured data or real-time data streams.
4.Limited scalability: ETL process can be limited in terms of scalability, as it
may not be able to handle very large amounts of data.
5.Data privacy concerns: ETL process can raise concerns about data privacy, as
large amounts of data are collected, stored, and analyzed.

You might also like