Module 6
Module 6
Emerging trends
Emergence of cloud services and
infrastructure in Data Warehousing
A cloud data warehouse is a modern way of storing and managing large amounts of
data in a public cloud. It lets you quickly access and use your data.
This makes it the perfect solution for businesses that rely on data and require agility,
flexibility, and ease of use for their infrastructure requirements.
Key features:
1. Separation of storage and compute
2. Data integration and management
3. Data storage
4. Database performance
5. Security and compliance
Benefits
Increased flexibility and scalability
Reduced cost
Enhanced security
Increased performance
Increased collaboration
Parallel processing reduces the time required to manage data
Dynamic allocation of computing resources reduce cost and improve performance
Cost of administration is limited with cloud service providers managing backend systems
Cloud acts as a failsafe system as disaster recovery is assured
Dynamic pricing plans make it affordable even for small team operations
With the continuous innovation by the data warehouse providers, the distinction
between various platforms is widening. This has provoked businesses to compare cloud-
based data warehouses to identify the best solution. Data Warehouse as a Service is
expected to reach USD 7.69 Billion by 2028.
Evolution of cloud data warehouse
Traditional warehouse Vs Cloud
warehouse
Traditional warehouse Cloud warehouse
Data Storage Can handle only a limited amount of data Can handle virtually limitless data with parallel
based on the availability of systems and processing and infinite scalability
resources at a time
Semi-structured and unstructured data is Tuned to handle unstructured data which is
difficult to handle with on-prem warehouses automatically transformed for usability with
‘schema-on-write’
Interoperability The interoperability of different technologies A virtual interoperable layer sits on the data source
and orchestration of separate systems is to allow easy integration of data from different
challenging systems
Scaling Scaling up is tedious and time-consuming as Instant scaling is possible on demand, both
challenges both hardware and software must be vertically and horizontally
reconfigured
Scaling up requires huge investments in On-demand scaling allows companies to make
hardware and human resources incremental investments that are affordable
Infrastructure of cloud
datawarehouse
On the other hand, a cloud warehouse is a database stored as a managed
service in a public cloud for scalable business intelligence.
Cloud warehouses were built to address the needs of modern organizations.
A major difference is due to the separation of compute and storage in
the cloud that makes the warehouse dynamic.
Moreover, with storage, traditional warehouses followed a star schema
which was costly, especially with high volumes and wider varieties.
And unlike warehouses, cloud architecture includes a shared space to access
in parallel and thus, delivered improvements in the scale and
performance.
Sharing of resources between different users means enterprises only must
pay for their utilization rather than the whole infrastructure.
Business case 1:
A. Introduction:
In today's highly competitive business landscape, organisations face the challenge of
effectively managing and leveraging their ever-growing data assets.
One of the major hurdles they encounter is the presence of data silos—isolated
repositories of information scattered across different systems and departments.
These silos impede collaboration, hinder data accessibility, and limit the ability to gain
comprehensive insights.
To address this issue, the construction of a new data warehouse emerges as a
strategic solution that can centralise and integrate data, leading to improved decision-
making, enhanced business insights, and increased operational efficiency.
B. Problem Statement:
The presence of data silos within our organisation creates numerous
complications and inefficiencies. Data resides in disparate systems, making it
difficult to access and analyse holistically.
This fragmentation hampers collaboration and leads to inconsistent and
unreliable reporting. Valuable time and resources are wasted on manual data
integration, resulting in delays in decision-making and missed opportunities.
Furthermore, compliance and governance requirements become challenging to
fulfill due to the lack of standardised data management practices across the
organisation. To overcome these obstacles and unlock the full potential of our
data, we propose the construction of a new data warehouse.
C. Objectives:
The primary objectives of building a new data warehouse are as follows:
1. Centralising Data: Establish a single, unified repository to consolidate data
from diverse sources and eliminate data silos.
2. Improving Data Accessibility: Provide a user-friendly interface that enables
stakeholders across the organisation to easily access and retrieve relevant
data.
3. Enhancing Data Quality and Consistency: Implement robust data
governance practices, data cleansing, and validation mechanisms to ensure
consistent and accurate information.
4. Enabling Advanced Analytics and Reporting: Create a foundation for
advanced analytics, data modeling, and real-time reporting capabilities.
5. Facilitating Data Governance and Compliance: Establish data governance
policies and procedures to ensure compliance with regulatory requirements
and data privacy standards.
Data lakes
A data lake is a centralized repository that ingests and stores large volumes
of data in its original form.
The data can then be processed and used as a basis for a variety of analytic needs.
Due to its open, scalable architecture, a data lake can accommodate all
types of data from any source, from structured (database tables, Excel
sheets) to semi-structured (XML files, webpages) to unstructured (images,
audio files, tweets), all without sacrificing fidelity.
The data files are typically stored in staged zones—raw, cleansed, and curated—so
that different types of users may use the data in its various forms to meet their needs.
Data lakes provide core data consistency across a variety of applications, powering
big data analytics, machine learning, predictive analytics, and other forms of
intelligent action.
Use cases
Streaming media. Subscription-based streaming companies collect and process insights
on customer behavior, which they may use to improve their recommendation algorithm.
Finance. Investment firms use the most up-to-date market data, which is collected and
stored in real time, to efficiently manage portfolio risks.
Healthcare. Healthcare organizations rely on big data to improve the quality of care for
patients. Hospitals use vast amounts of historical data to streamline patient pathways,
resulting in better outcomes and reduced cost of care.
Omnichannel retailer. Retailers use data lakes to capture and consolidate data that's
coming in from multiple touchpoints, including mobile, social, chat, word-of-mouth, and in
person.
IoT. Hardware sensors generate enormous amounts of semi-structured to unstructured
data on the surrounding physical world. Data lakes provide a central repository for this
information to live in for future analysis.
Digital supply chain. Data lakes help manufacturers consolidate disparate warehousing
data, including EDI systems, XML, and JSONs.
Sales. Data scientists and sales engineers often build predictive models to help
determine customer behavior and reduce overall churn.
Data lake Vs Data warehouse
Data lake Data warehouse
Type Structured, semi-structured, Structured
unstructured
Amazon Redshift is optimized for handling large-scale Azure SQL Data Warehouse is designed for
Type of data data processing and analytics tasks. It can handle a processing and analyzing large volumes of data. It
processing variety of data types, including structured data from a can handle both structured and unstructured data,
variety of sources. making it a versatile tool for data processing.
Characteristics AWS Redshift Azure SQL Data Warehouse