0% found this document useful (0 votes)
5 views16 pages

Module 6

Uploaded by

Bhavana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

Module 6

Uploaded by

Bhavana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Module 6

Emerging trends
Emergence of cloud services and
infrastructure in Data Warehousing
 A cloud data warehouse is a modern way of storing and managing large amounts of
data in a public cloud. It lets you quickly access and use your data.
 This makes it the perfect solution for businesses that rely on data and require agility,
flexibility, and ease of use for their infrastructure requirements.
 Key features:
1. Separation of storage and compute
2. Data integration and management
3. Data storage
4. Database performance
5. Security and compliance
Benefits
 Increased flexibility and scalability
 Reduced cost
 Enhanced security
 Increased performance
 Increased collaboration
 Parallel processing reduces the time required to manage data
 Dynamic allocation of computing resources reduce cost and improve performance
 Cost of administration is limited with cloud service providers managing backend systems
 Cloud acts as a failsafe system as disaster recovery is assured
 Dynamic pricing plans make it affordable even for small team operations

 With the continuous innovation by the data warehouse providers, the distinction
between various platforms is widening. This has provoked businesses to compare cloud-
based data warehouses to identify the best solution. Data Warehouse as a Service is
expected to reach USD 7.69 Billion by 2028.
Evolution of cloud data warehouse
Traditional warehouse Vs Cloud
warehouse
Traditional warehouse Cloud warehouse
Data Storage Can handle only a limited amount of data Can handle virtually limitless data with parallel
based on the availability of systems and processing and infinite scalability
resources at a time
Semi-structured and unstructured data is Tuned to handle unstructured data which is
difficult to handle with on-prem warehouses automatically transformed for usability with
‘schema-on-write’
Interoperability The interoperability of different technologies A virtual interoperable layer sits on the data source
and orchestration of separate systems is to allow easy integration of data from different
challenging systems
Scaling Scaling up is tedious and time-consuming as Instant scaling is possible on demand, both
challenges both hardware and software must be vertically and horizontally
reconfigured
Scaling up requires huge investments in On-demand scaling allows companies to make
hardware and human resources incremental investments that are affordable
Infrastructure of cloud
datawarehouse
 On the other hand, a cloud warehouse is a database stored as a managed
service in a public cloud for scalable business intelligence.
 Cloud warehouses were built to address the needs of modern organizations.
A major difference is due to the separation of compute and storage in
the cloud that makes the warehouse dynamic.
 Moreover, with storage, traditional warehouses followed a star schema
which was costly, especially with high volumes and wider varieties.
 And unlike warehouses, cloud architecture includes a shared space to access
in parallel and thus, delivered improvements in the scale and
performance.
 Sharing of resources between different users means enterprises only must
pay for their utilization rather than the whole infrastructure.
Business case 1:
A. Introduction:
 In today's highly competitive business landscape, organisations face the challenge of
effectively managing and leveraging their ever-growing data assets.
 One of the major hurdles they encounter is the presence of data silos—isolated
repositories of information scattered across different systems and departments.
 These silos impede collaboration, hinder data accessibility, and limit the ability to gain
comprehensive insights.
 To address this issue, the construction of a new data warehouse emerges as a
strategic solution that can centralise and integrate data, leading to improved decision-
making, enhanced business insights, and increased operational efficiency.
B. Problem Statement:
 The presence of data silos within our organisation creates numerous
complications and inefficiencies. Data resides in disparate systems, making it
difficult to access and analyse holistically.
 This fragmentation hampers collaboration and leads to inconsistent and
unreliable reporting. Valuable time and resources are wasted on manual data
integration, resulting in delays in decision-making and missed opportunities.
 Furthermore, compliance and governance requirements become challenging to
fulfill due to the lack of standardised data management practices across the
organisation. To overcome these obstacles and unlock the full potential of our
data, we propose the construction of a new data warehouse.
C. Objectives:
 The primary objectives of building a new data warehouse are as follows:
1. Centralising Data: Establish a single, unified repository to consolidate data
from diverse sources and eliminate data silos.
2. Improving Data Accessibility: Provide a user-friendly interface that enables
stakeholders across the organisation to easily access and retrieve relevant
data.
3. Enhancing Data Quality and Consistency: Implement robust data
governance practices, data cleansing, and validation mechanisms to ensure
consistent and accurate information.
4. Enabling Advanced Analytics and Reporting: Create a foundation for
advanced analytics, data modeling, and real-time reporting capabilities.
5. Facilitating Data Governance and Compliance: Establish data governance
policies and procedures to ensure compliance with regulatory requirements
and data privacy standards.
Data lakes
 A data lake is a centralized repository that ingests and stores large volumes
of data in its original form.
 The data can then be processed and used as a basis for a variety of analytic needs.
 Due to its open, scalable architecture, a data lake can accommodate all
types of data from any source, from structured (database tables, Excel
sheets) to semi-structured (XML files, webpages) to unstructured (images,
audio files, tweets), all without sacrificing fidelity.
 The data files are typically stored in staged zones—raw, cleansed, and curated—so
that different types of users may use the data in its various forms to meet their needs.
 Data lakes provide core data consistency across a variety of applications, powering
big data analytics, machine learning, predictive analytics, and other forms of
intelligent action.
Use cases
 Streaming media. Subscription-based streaming companies collect and process insights
on customer behavior, which they may use to improve their recommendation algorithm.
 Finance. Investment firms use the most up-to-date market data, which is collected and
stored in real time, to efficiently manage portfolio risks.
 Healthcare. Healthcare organizations rely on big data to improve the quality of care for
patients. Hospitals use vast amounts of historical data to streamline patient pathways,
resulting in better outcomes and reduced cost of care.
 Omnichannel retailer. Retailers use data lakes to capture and consolidate data that's
coming in from multiple touchpoints, including mobile, social, chat, word-of-mouth, and in
person.
 IoT. Hardware sensors generate enormous amounts of semi-structured to unstructured
data on the surrounding physical world. Data lakes provide a central repository for this
information to live in for future analysis.
 Digital supply chain. Data lakes help manufacturers consolidate disparate warehousing
data, including EDI systems, XML, and JSONs.
 Sales. Data scientists and sales engineers often build predictive models to help
determine customer behavior and reduce overall churn.
Data lake Vs Data warehouse
Data lake Data warehouse
Type Structured, semi-structured, Structured
unstructured

Relational, non-relational Relational


Schema Schema on read Schema on write
Format Raw, unfiltered Processed, vetted
Sources Big data, IoT, social media, streaming Application, business, transactional data,
data batch reporting

Scalability Easy to scale at a low cost Difficult and expensive to scale

Users Data scientists, data engineers Data warehouse professionals, business


analysts

Use cases Machine learning, predictive analytics, Core reporting, BI


real-time analytics
Managed Datawarehouses
 Managed cloud data warehouse services enable you to create a data
environment that can adapt and evolve according to your data sources, changing
business requirements, and overall long-term goals.
Characteristic AWS Redshift Azure SQL Data Warehouse
s

Large-scale data processing and analysis Business


Data warehousing Machine learning Business intelligence and analytics Predictive modeling
Use Cases
intelligence reporting Ad hoc queries ETL processing Customer profiling Fraud detection Supply chain
optimization Financial analysis Healthcare analytics

Azure SQL Data Warehouse may not be the best


tool for small to medium-sized businesses that do
Amazon Redshift may not be suitable for applications
When not to not have a significant amount of data to process.
that require real-time data processing or handling of
use Additionally, businesses that do not require
unstructured data.
advanced analytics or machine learning may find
that other tools are more cost-effective.

Amazon Redshift is optimized for handling large-scale Azure SQL Data Warehouse is designed for
Type of data data processing and analytics tasks. It can handle a processing and analyzing large volumes of data. It
processing variety of data types, including structured data from a can handle both structured and unstructured data,
variety of sources. making it a versatile tool for data processing.
Characteristics AWS Redshift Azure SQL Data Warehouse

The tool supports data ingestion from a


Amazon Redshift can ingest data from a variety of sources, variety of sources, including Azure Data
Data ingestion including data lakes, databases, and streaming data Factory, Azure Stream Analytics, and
sources. other Azure services. It can also ingest
data from on-premises data sources.
Azure SQL Data Warehouse provides
built-in support for data transformation
Amazon Redshift provides built-in support for data
Data using SQL Server Integration Services
transformation and cleansing operations, including data
transformation (SSIS). Users can also use Azure Data
type conversion, aggregation, and filtering.
Factory or other tools for data
transformation.
The tool supports machine learning
Amazon Redshift provides integration with various
Machine learning using Azure Machine Learning. Users can
machine learning frameworks and tools, including
support use machine learning models to analyze
SageMaker, TensorFlow, and more.
data and gain insights from it.
Amazon Redshift supports standard SQL, as well as a
Azure SQL Data Warehouse uses SQL for
Query language variety of SQL extensions for handling large-scale data
querying data
processing tasks.
Amazon Redshift is a fully-managed service that can be
The tool is a cloud-based service and
Deployment model deployed in a variety of configurations, including single-
can be deployed on Microsoft Azure.
node and multi-node clusters.
Azure SQL Data Warehouse integrates
Amazon Redshift integrates with a variety of other AWS
Integration with with other Azure services, including
services, including S3, EMR, and more. It also supports
other services Azure Data Factory, Azure Stream
integration with third-party tools and services.
Analytics, and Azure Machine Learning.
Characteristics AWS Redshift Azure SQL Data Warehouse

Amazon Redshift provides robust security features,


The tool provides a range of security features,
including encryption, access control, and
Security including data encryption, user
compliance with various industry standards and
authentication, and access controls.
regulations.
Azure SQL Data Warehouse uses a pay-as-
Amazon Redshift offers a variety of pricing models,
you-go pricing model. Users are charged
Pricing model including pay-as-you-go pricing, reserved instance
based on the amount of data processed and
pricing, and more.
the amount of storage used.
Azure SQL Data Warehouse is highly scalable
Amazon Redshift is designed to be highly scalable, and can handle large amounts of data
Scalability with support for scaling up and down based on processing. Users can scale up or down as
workload requirements. needed to meet their data processing
requirements.
Amazon Redshift provides high-performance data
processing and analytics capabilities through its The tool provides high performance for data
Performance
optimized query engine and support for distributed processing and analysis.
processing.
Amazon Redshift is designed to be highly available, Azure SQL Data Warehouse is highly
Availability with support for automatic failover and data available, with built-in redundancy and
replication across multiple availability zones. failover capabilities.
Amazon Redshift provides built-in fault-tolerance
and data recovery features, ensuring that data The tool is highly reliable, with built-in data
Reliability
processing tasks are completed correctly and replication and disaster recovery capabilities.
accurately.

You might also like