DW Unit I Notes
DW Unit I Notes
UNIT I
INTRODUCTION TO DATA WAREHOUSE
A data warehouse is a centralized repository that stores data from various sources.
The primary purpose of a data warehouse is to enable efficient querying and analysis
of large volumes of data. Unlike traditional databases that are optimized for
transactional processing, data warehouses are designed for read-heavy operations,
such as complex queries and reports.
Key Features
Data Integration: Data warehouses integrate data from multiple, often disparate
sources such as operational databases, external data feeds, and legacy systems. This
integration involves data cleaning and transformation processes to ensure consistency
and quality.
Historical Data: Data warehouses are designed to store historical data, making it
possible to perform trend analysis and track changes over time. This is in contrast to
transactional databases, which typically store only current data.
Data Modeling: Data in a warehouse is often organized using multidimensional
models (e.g., star schema, snowflake schema) or data marts to facilitate complex
queries. These models support analytical operations and business intelligence
activities.
OLAP (Online Analytical Processing): Data warehouses support OLAP, which
enables users to perform multidimensional analysis, such as slicing, dicing, and
drilling down into data to gain insights.
ETL Process: Data warehouses typically use ETL (Extract, Transform, Load)
processes to move data from source systems into the warehouse. This process
involves extracting data from source systems, transforming it into a suitable format,
and loading it into the data warehouse.
Benefits
Improved Decision-Making: By consolidating data from various sources and
providing powerful analytical tools, data warehouses help organizations make
informed decisions based on comprehensive insights.
Enhanced Performance: Data warehouses are optimized for read-heavy operations,
allowing for fast query performance and efficient handling of large datasets.
Historical Analysis: Storing historical data enables trend analysis and forecasting,
providing valuable insights into past performance and future projections.
Data Consistency: The integration and transformation processes ensure that data is
consistent and accurate across the organization, reducing discrepancies and improving
reliability.
Use Cases/Application
• Business Intelligence: Reporting, dashboards, and performance metrics.
• Data Mining: Discovering patterns and relationships in large datasets.
• Trend Analysis: Tracking and analyzing historical trends.
• Predictive Analytics: Using historical data to forecast future trends.
Data Volume: Handles a high volume of o Data Volume: Manages large volumes of
current, real-time data related to ongoing historical data from various sources.
transactions. o Historical Data: Stores historical data over long
Historical Data: Typically stores only current periods, enabling trend analysis and historical
data and recent transaction history, with limited comparisons.
historical data retention.
o Performance: Optimized for quick, small o Performance: Optimized for complex queries,
transactions and high-speed processing of read large-scale data retrieval, and aggregations.
and write operations. o Querying: Supports complex, multi-
o Querying: Supports simple, real-time queries dimensional queries and analytical processing.
related to current business operations.
o
o
Data Integration and ETL o
o
o Integration: Integrates with other operational o Integration: Integrates data from multiple
systems for real-time data processing. sources, including operational databases,
ETL: Not typically involved in ETL processes, external sources, and flat files.
as it handles real-time transaction processing. o ETL: Employs ETL (Extract, Transform, Load)
processes to extract data from source systems,
transform it, and load it into the warehouse.
Example Technologies
o
o
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between
physically available sources and data warehouses, in fact, consists of four subsequent
data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That
data is stored initially to corporate relational databases or legacy databases, or it may
come from an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools
(ETL) can combine heterogeneous schemata, extract, transform, cleanse, validate,
filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual
repository: a data warehouse. The data warehouses can be directly accessed, but it can
also be used as a source for creating data marts, which partially replicate data
warehouse contents and are designed for specific enterprise departments. Meta-data
repositories store information on sources, access procedures, data staging, users, data
mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue
reports, dynamically analyze information, and simulate hypothetical business
scenarios. It should feature aggregate information navigators, complex query
optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and
data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source
data extraction and integration from those of data warehouse population. In some
cases, the reconciled layer is also directly used to accomplish better some operational
tasks, such as producing daily reports that cannot be satisfactorily prepared using the
corporate applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A
disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away from
being real-time.
Data Warehouses usually have a three-level (tier) architecture that includes:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is almost always
an RDBMS. It may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data
provided by external consultants) are extracted using application program interfaces
called a gateway. A gateway is provided by the underlying DBMS and allows
customer programs to generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-
DB (Open-Linking and Embedding for Databases), by Microsoft, and JDBC (Java
Database Connection).
A middle-tier which consists of an OLAP server for fast querying of the data
warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that
maps functions on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server
that directly implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as
well as additional tools for data mining of the OLAP-generated data.
The overall Data Warehouse Architecture is shown in fig:
The metadata repository stores information that defines DW objects. It includes the
following parameters and information for the middle and the top-tier applications:
1. A description of the DW structure, including the warehouse schema, dimension,
hierarchies, data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data,
i.e., active, archived or purged, and warehouse monitoring information, i.e., usage
statistics, error reports, audit, etc.
3. System performance data, which includes indices, used to improve data access and
retrieval performance.
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which
include business terms and definitions, ownership information, etc.
Load Performance
Data warehouses require increase loading of new data periodically basis within
narrow time windows; performance on the load process should be measured in
hundreds of millions of rows and gigabytes per hour and must not artificially constrain
the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse,
including data conversion, filtering, reformatting, indexing, and metadata update.
Fact-based management demands the highest data quality. The warehouse ensures
local consistency, global consistency, and referential integrity despite "dirty" sources
and massive database size.
Query Performance
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few
to hundreds of gigabytes and terabyte-sized data warehouses.
Snowflake Schema
A snowflake schema is equivalent to the star schema. "A schema is known as a
snowflake if one or more dimension tables do not connect directly to the fact table
but must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake. Snowflaking is a method of normalizing
the dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is
diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third
normal form. Each dimension table performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schemas can have any number of dimension, and each
dimension can have any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store,
Location, Time, Product, Line, and Family dimension tables. The Market dimension
has two dimension tables with Store as the primary dimension table, and Location as
the outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the
outrigger dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This
needed more disk space than a more normalized snowflake schema. Snowflaking
normalizes the dimension by moving attributes with low cardinality into separate
dimension tables that relate to the core dimension table by using foreign keys.
Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake
dimension tables are damaged into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales
fact table include quantity, price, and other relevant metrics. SALESREP,
CUSTOMER, PRODUCT, and TIME are the dimension tables.
The STAR schema for sales, as shown above, contains only five tables, whereas the
normalized version now extends to eleven tables. We will notice that in the snowflake
schema, the attributes with low cardinality in each original dimension tables are
removed to form separate tables. These new tables are connected back to the original
dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex
dimensions and relationship. It is suitable for many to many and one to many
relationships between dimension levels.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query
performance due to minimized disk storage requirements and joining smaller lookup
tables.
2. It provides greater scalability in the interrelationship between dimension levels and
components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
1. The primary disadvantage of the snowflake schema is the additional maintenance
efforts required due to the increasing number of lookup tables. It is also known as a
multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
Autonomous Data Warehouse (ADW) and Snowflake are both leading cloud data
warehousing solutions, but they have distinct features and architectures. Here’s a
comparison of the two to help you understand their key differences and similarities:
1. Overview
• Autonomous Data Warehouse (ADW): This is Oracle's cloud-based data warehouse
solution that uses machine learning and automation to manage, optimize, and secure
the data warehouse environment. It is part of Oracle's broader suite of cloud services.
• Snowflake: Snowflake is a cloud-native data warehousing platform designed for
simplicity, scalability, and performance. It’s known for its unique architecture and
ability to handle structured and semi-structured data.
2. Architecture
• ADW:
o Architecture: Built on Oracle's database technology, ADW provides a traditional data
warehousing architecture with automated management features. It includes automatic
tuning, scaling, and security.
o Storage: It uses a single unified storage layer that is shared across multiple compute
instances.
o Compute and Storage Separation: ADW allows for separate scaling of compute and
storage, but it may require manual configuration for optimal performance.
• Snowflake:
o Architecture: Snowflake has a multi-cluster architecture with a clear separation
between compute, storage, and services layers. This allows for independent scaling of
compute and storage.
o Storage: Snowflake uses a central storage layer that is accessible by all compute
clusters.
o Compute and Storage Separation: Snowflake provides dynamic scaling of compute
resources and automatic data storage management, which is highly automated.
3. Automation and Management
• ADW:
o Automation: Provides extensive automation including self-tuning, automatic
backups, and self-healing capabilities.
o Management: Oracle’s ADW automates many administrative tasks, but some
configurations might still require user input, especially for more complex setups.
• Snowflake:
o Automation: Highly automated with features like automatic scaling, auto-suspend,
and auto-resume. Snowflake manages infrastructure and scaling seamlessly.
o Management: Offers minimal management overhead with automatic optimizations,
making it very user-friendly.
4. Performance and Scalability
• ADW:
o Performance: Performance tuning is automatic but based on Oracle’s traditional
database optimizations. It can be highly performant for complex queries due to its
underlying database technology.
o Scalability: Supports horizontal scaling, but you might need to manage compute
clusters and storage more directly compared to Snowflake.
• Snowflake:
o Performance: Known for high performance due to its architecture, which allows
multiple compute clusters to operate independently without contention.
o Scalability: Provides automatic scaling with virtually unlimited compute and storage
resources. Snowflake can handle varying workloads with ease due to its elastic
compute resources.
5. Data Integration and Support
• ADW:
o Data Integration: Integrates well with other Oracle Cloud services and tools, but
may require additional configuration for integration with non-Oracle tools.
o Support for Data Types: Supports a wide range of data types including structured
and semi-structured data.
• Snowflake:
o Data Integration: Known for its ease of integration with a variety of data sources
and tools, including third-party ETL and BI tools.
o Support for Data Types: Provides strong support for both structured and semi-
structured data (e.g., JSON, Avro, Parquet).
6. Security
• ADW:
o Security Features: Includes automated encryption, access controls, and auditing.
Oracle has a strong focus on security, which is a significant part of its offering.
o Compliance: Meets various industry compliance standards and provides
comprehensive security features.
• Snowflake:
o Security Features: Offers built-in encryption, access controls, and compliance
certifications. Snowflake provides a secure environment with automatic security
updates.
o Compliance: Adheres to major compliance and regulatory standards, including
GDPR, HIPAA, and more.
7. Cost Structure
• ADW:
o Pricing Model: Pricing can be based on the provisioned resources, including compute
and storage. Oracle offers a pay-as-you-go model and can also provide reserved
instances.
o Cost Management: Costs can be controlled through management of resources and
usage.
• Snowflake:
o Pricing Model: Uses a pay-per-usage model for compute and storage, with separate
charges for each. It also offers on-demand pricing and capacity-based pricing options.
o Cost Management: Provides features like auto-suspend and auto-resume to help
manage costs effectively.
Summary
• Oracle Autonomous Data Warehouse is highly automated with a strong focus on
Oracle’s traditional database strengths and integrates well with Oracle’s ecosystem.
• Snowflake is designed to be a cloud-native solution with strong support for
scalability, ease of use, and integration across various platforms.
A modern data warehouse is a cloud-based or hybrid solution designed to handle the
demands of today's data-driven organizations. It emphasizes scalability, flexibility,
and ease of use while integrating various data sources and enabling advanced
analytics. Here’s an overview of what characterizes a modern data warehouse and
some key considerations:
Characteristics of a Modern Data Warehouse
1. Cloud-Native Architecture
o Scalability: Easily scales compute and storage resources independently to handle
large and variable workloads.
o Elasticity: Automatically adjusts resources based on demand, ensuring efficient use
of resources and cost management.
o Managed Services: Typically offered as a fully managed service, reducing the need
for manual maintenance and administrative overhead.
2. Data Integration and Management
o Support for Diverse Data Types: Handles structured, semi-structured, and
unstructured data, including JSON, XML, and text files.
o Real-Time Data Processing: Capable of ingesting and processing data in real time
or near-real time to support dynamic analytics.
o Data Lake Integration: Integrates with data lakes to provide a unified view of data
across different storage formats and sources.
3. Advanced Analytics and Machine Learning
o Built-In Analytics: Provides tools for advanced analytics, including SQL-based
querying, data mining, and business intelligence.
o Machine Learning Integration: Often includes integration with machine learning
frameworks and tools for predictive analytics and data science.
4. Performance and Optimization
o Automatic Tuning: Uses algorithms to automatically optimize query performance,
resource allocation, and workload management.
o Concurrency Handling: Efficiently manages multiple users and queries concurrently
without performance degradation.
5. Security and Compliance
o Data Encryption: Ensures data is encrypted both at rest and in transit.
o Access Controls: Provides granular access control and user management features.
o Compliance: Adheres to various industry regulations and standards such as GDPR,
HIPAA, and PCI-DSS.
6. Cost Management
o Pay-as-You-Go: Uses a pay-per-use or consumption-based pricing model, allowing
organizations to pay only for the resources they use.
o Cost Optimization Tools: Includes features for monitoring and managing costs, such
as auto-suspend and auto-resume for compute resources.
Popular Modern Data Warehousing Solutions
1. Snowflake
o Architecture: Cloud-native with a multi-cluster architecture separating compute and
storage.
o Features: Offers automatic scaling, advanced data sharing capabilities, and strong
support for both structured and semi-structured data.
2. Amazon Redshift
o Architecture: Fully managed data warehouse service with a columnar storage format.
o Features: Provides high performance for complex queries, integrates with AWS
ecosystem, and supports both on-demand and reserved instances.
3. Google BigQuery
o Architecture: Serverless and highly scalable with a focus on big data analytics.
o Features: Offers real-time analytics, integration with Google Cloud services, and
automatic scaling.
4. Microsoft Azure Synapse Analytics (formerly SQL Data Warehouse)
o Architecture: Combines big data and data warehousing capabilities with a serverless
SQL pool and on-demand querying.
o Features: Integrates with Azure ecosystem, supports both on-demand and
provisioned query capabilities, and offers advanced analytics.
5. Oracle Autonomous Data Warehouse
o Architecture: Cloud-based with automated management features.
o Features: Offers automatic tuning, scaling, and security, and integrates well with
Oracle’s broader cloud services.
Key Considerations When Choosing a Modern Data Warehouse
1. Data Requirements: Assess whether you need support for structured, semi-
structured, or unstructured data and how real-time processing fits into your use case.
2. Integration Needs: Evaluate how well the data warehouse integrates with your
existing data sources, analytics tools, and business applications.
3. Performance and Scalability: Consider the performance requirements of your
workloads and the scalability options offered by the data warehouse.
4. Cost: Analyze the pricing model and ensure it aligns with your budget and usage
patterns. Look for features that help manage and optimize costs.
5. Security and Compliance: Ensure the data warehouse meets your security and
compliance requirements, including data encryption, access controls, and audit
capabilities.
Conclusion
A modern data warehouse provides a flexible, scalable, and cost-effective solution for
managing and analyzing large volumes of data. It leverages cloud technologies to
offer advanced features and integrations, making it easier for organizations to derive
actionable insights and support data-driven decision-making.