0% found this document useful (0 votes)
61 views166 pages

DWH Sessions 1-4

Uploaded by

2022dc04204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views166 pages

DWH Sessions 1-4

Uploaded by

2022dc04204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 166

Cloud Data Warehousing

Instructor-in-Charge:
Sachin Arora, Guest faculty
BITS Pilani BITS Pilani
Pilani Campus
Know Each Other

Name:
Total Experience:
Current Company:
Technologies familiar with:
Expectation:

BITS Pilani, Pilani Campus


Session Protocols

• Stay Focused
• Ask Questions at any time - keep it interactive
• If you don’t need to talk, keep your line muted to avoid
any background noise

BITS Pilani, Pilani Campus


Disclaimer
This content has been prepared from experience and various online content
to curate a customized content focusing on the topic.

BITS Pilani, Pilani Campus


Course Objective
1. Comprehend the significance of data warehousing in leveraging business
intelligence (BI) and its potential for driving informed decision-making.
2. Explore various schema designs, information delivery methods, and
suitable architectures tailored for effective data warehousing.
3. Gain insight into the essential processes, management strategies, and
infrastructure required to construct a robust and functional data
warehouse.
4. Analyze the transformative journey of data warehouses within the
context of big data and cloud environments, recognizing their impact and
evolving role in contemporary data management.

BITS Pilani, Pilani Campus


Pre-Requisites
1. Basic Understanding of Databases: Familiarity with databases and their concepts, such
as tables, rows, and relationships, is crucial.
2. SQL Proficiency: A strong grasp of SQL (Structured Query Language) is essential as it is
the language used to interact with databases, which is a fundamental part of data
warehousing.
3. Foundational Knowledge in Data Analysis: Understanding basic data analysis concepts
and techniques, such as data visualization, aggregations, and basic statistical analysis, will
be helpful.
4. Awareness of Business Intelligence Concepts: Basic knowledge of business intelligence
and how data is used for decision-making and reporting.
5. Data Modeling Understanding: Knowledge of data modeling concepts, including entity-
relationship diagrams and understanding how data is structured and related.

BITS Pilani, Pilani Campus


Evaluation Plan

Name Type Weight

Quiz - 2 Objective 10%

20% - Individual
Assignment 1 20%
Assignment

Mid-term Exam Closed book 30%

End Semester Exam Open book 40%

BITS Pilani, Pilani Campus


Text Books
Ponniah P, “Data Warehousing Fundamentals”, Wiley
T1
Student Edition, 2012
Kimball R, “The Data Warehouse Toolkit”, 3e, John Wiley,
T2
2013

BITS Pilani, Pilani Campus


Reference Books
Anahory S, & Dennis M, “Data Warehousing in the Real World”, Pearson Education,
R1
2008.
Kimball R, Reeves L, Ross M, & Thornthwaite, W, “The Data Warehouse Lifecycle
R2
Toolkit”, John Wiley, 2e, 2012.
Jiawei Han, Micheline Kamber and Jian Pei, “Data Mining: Concepts and Techniques”,
R3
Morgan Kaufmann Publishers 2012
Amazon Redshift Database Developer Guide http://docs.aws.amazon.com/redshift -
R4
online resource for the DW on the Cloud
Krish Krishnan, “Data Warehousing in the Age of Big Data”, Morgan Kaufmann
R5
Publishers 2013
William H Inmon, et al., “DW 2.0 : The Architecture for the Next Generation of Data
R6
Warehousing”, Morgan Kaufmann 2012
R7 Cloud Data Warehousing for Dummies by Joe Kraynak, David Baum
Automating the Modern Data Warehouse A Comprehensive Guide for Optimal Data
R8
Management by Steve Swoy

BITS Pilani, Pilani Campus


Module 1

BITS Pilani, Pilani Campus


Module 1 – Introduction to Data Warehousing

1.1 Evolution of Data Warehousing


1.2 Data Warehousing Definition
1.3 Business Need for Data Warehouse
1.4 Comparison of Data Warehouse with other business software

BITS Pilani, Pilani Campus


Evolution of Data Warehouse (DW)

Cloud
Hadoop • BigQuery
• Redshift
• Hive • Synapse
• Hbase • Snowflake
Appliances • Spark
• Teradata
• Netezza
• Exadata
• Vertica
RDBMS • SAP HANA

• Oracle
• SQL Server
• MySQL
• PostgreSQL

BITS Pilani, Pilani Campus


What is Data Warehouse (DW)?

BITS Pilani, Pilani Campus


Understanding Data Warehouse - Library Analogy
Multiple Type of Books

Library
Librarian

Library User

BITS Pilani, Pilani Campus


Data Warehouse (DW)

Data Source 1

Visualization
Tools to read
data from Data
Data Source 2
Source and load Data Warehouse
into DW

Data Source 3
Ad-Hoc Reporting

A data warehouse is a system for storing and managing data from multiple sources in
a single, central location. It is designed to support complex analytics and decision-
making.

BITS Pilani, Pilani Campus


Characteristics of Data Warehouse (DW)

Subject Time Non-


Integrated
Oriented Variant Volatile

1 2 3 4

BITS Pilani, Pilani Campus


Characteristics of Data Warehouse (DW)
Subject-Oriented
A data warehouse is organized around specific subjects or areas of interest. These subjects or areas
of interest can be sales, customers, or products. This subject orientation allows data to be
organized. It also allows data to be analyzed in a relevant way for business users.
Integrated
A data warehouse integrates data from various sources. These sources may involve a cloud,
relational databases, structured and semi-structured data, etc. The sources are integrated in a
sequential manner. They are consistent, relatable, and ideally certifiable. They provide a business
with confidence in the data’s quality.
Time-Variant
An organization keeps historical data over time in a data warehouse. This makes it possible to spot
patterns and trends. Based on what has previously occurred, it helps to make better decisions.
Organizations can spot trends and make better decisions for the future by examining patterns
throughout time.
Non-Volatile
A data warehouse is non-volatile. It means that once data is loaded into the warehouse, it cannot
be modified or deleted. This helps to ensure the accuracy and consistency of the data. It maintains a
historical record of changes over time.

BITS Pilani, Pilani Campus


Business need of DW – A Real Life Example

Prime Tyre
HR S/W ERP S/W CRM S/W
Manufacturing
(A Fictitious Company)

BITS Pilani, Pilani Campus


Business need of DW

Visualization

Data Warehouse

Ad-Hoc Reporting

BITS Pilani, Pilani Campus


Few Terminologies & Comparisons

BITS Pilani, Pilani Campus


Data Modeling
Definition - Data modeling is the process of creating a visual representation of
either a whole information system or parts of it to communicate connections
between data points and structures.

Purpose: The goal of data modeling to illustrate the types of data used and
stored within the system, the relationships among these data types, the ways
the data can be grouped and organized and its formats and attributes..

BITS Pilani, Pilani Campus


Data Modeling – An Example
Invoice Invoice_ID(PK)

Customer_ID(FK)
Customer
Invoice_Date
Customer_ID(PK)
Billing_Address
Name Ship
Billing_City Ship_ID(PK)
Address
Billing_State Ship_Dly_date
City
Invoice_Amount Shipping_Address
State

PinCode

Support_RepID(FK) Inv_Line_ID(PK)

Invoice_ID(FK)

Ship_ID(FK)

Unit_Price

InvioceLine Quantity

BITS Pilani, Pilani Campus


Data Mining

Definition - Data mining is the process of using statistical analysis and machine
learning to discover hidden patterns, correlations, and anomalies within large
datasets.

Purpose: This information can aid you in decision-making, predictive modeling,


and understanding complex phenomena.

BITS Pilani, Pilani Campus


Data Mining - Process

Collection Understanding Preparation Modelling Evaluation

BITS Pilani, Pilani Campus


Data Mining - Process
•Step 1: Collection – First data is collected, organized, and filled into a data warehouse. The
data is stored and managed either in the cloud or in-house servers.
•Step 2: Understanding – In this step, data scientists and business analysts examine the
properties of the data and conduct an in-depth analysis from the context of a particular
problem statement as defined by the company. This is addressed using querying,
visualization, and reporting.
•Step 3: Preparation – Once the data sources of the available data are confirmed, the data is
cleared, constructed, and formatted into the required form. In this process, additional data
can also be explored at a greater depth, which is well informed by the insights and
uncovered in the previous stage.
•Step 4: Modeling – In this stage, for the prepared dataset, modeling techniques are
selected. A data model is just like a diagram that reflects and describes the relationships
between different types of information that are stored in the database. Common techniques
include decision trees, regression, clustering, classification, association rule mining, and
neural networks.
•Step 5: Evaluation – In the context of the business objectives, the model results are
evaluated. In this phase, due to new patterns that are discovered in the model results or
other factors, new business requirements may be raised.
BITS Pilani, Pilani Campus
Data Warehouse Vs. Data Mining
Aspect Data Warehousing Data Mining
To store and manage data for To discover patterns, trends, and insights
Purpose analysis, reporting, and decision within data that may not be immediately
making apparent
Storage and organization of Analysis and extraction of hidden patterns
Function
structured data from data
Provides a unified view of an Discovers hidden patterns or relationships
Objective
organization's data for analysis within the data
Structuring, storage, and Statistical analysis, machine learning, AI
Techniques
optimization of data algorithms
Supports reporting, querying, and Identifies trends, predictions, and decision
Usage
analytics support
Provides a centralized data
Uncovers insights for making informed
Benefit repository for business
decisions
intelligence
Data warehouses like Oracle, Techniques like clustering, regression,
Examples
SQL Server classification

BITS Pilani, Pilani Campus


OLTP Vs. OLAP

Insert Delete

Update

OLAP
OLTP
Let’s say your data set has figures for various products entered, including customers and time of
purchase. In cube form, for example, the X-axis in the coordinate system represents the time
dimension, the Z-axis represents the customer dimension, and the Y-axis represents your product
dimension. The size of the cube is therefore determined by the existing data and is therefore dynamic
from the ground up. Each combination of dimension data is assigned values, each of which represents
a coordinate in the cube.
Here’s an example: In March 2020, customer C purchased product B. The result of this relationship
are quantifiable values, such as the sales revenue or the amount purchased, which are stored, so to
speak, at the coordinate that can now be exactly determined (data point).
BITS Pilani, Pilani Campus
OLTP Vs. OLAP
OLAP Operational Database(OLTP)(Database)
It involves historical processing of information. It involves day-to-day processing.
OLAP systems are used by knowledge workers OLTP systems are used by clerks, DBAs, or
such as executives, managers, and analysts. database professionals.
It is used to analyze the business. It is used to run the business.
It focuses on Information out. It focuses on Data in.
It is based on Star Schema, Snowflake Schema,
It is based on Entity Relationship Model.
and Fact Constellation Schema.
It contains historical data. It contains current data.
It provides summarized and consolidated data. It provides primitive and highly detailed data.
It provides summarized and multidimensional It provides detailed and flat relational view of
view of data. data.
The number of users is in hundreds. The number of users is in thousands.
The number of records accessed is in millions. The number of records accessed is in tens.

BITS Pilani, Pilani Campus


Integrating Heterogenous Databases

BITS Pilani, Pilani Campus


Heterogenous Databases
Heterogeneous databases are databases that consist of data from multiple,
dissimilar sources. These sources may include different types of databases, such
as relational databases, NoSQL databases, and flat files.

BITS Pilani, Pilani Campus


Integrating Heterogenous

• Integration of heterogeneous databases in data warehousing refers to the process of


combining data from multiple, disparate databases into a central repository, known as
a data warehouse. This process involves extracting data from different sources, such as
relational databases, NoSQL databases, and flat files, and then transforming, cleaning,
and loading the data into the data warehouse. This is commonly called ETL or ELT in
Data Engineering World.

BITS Pilani, Pilani Campus


Approach for Integration of Heterogenous Databases
Approaches for Integration of Heterogeneous Databases
There are two different approaches to integrating heterogeneous databases

1. Query Driven Approach: This is the traditional approach to integrate


heterogeneous databases. The integration of data occurs dynamically at the time
of the query execution rather than through a pre-defined, static integration
process.

2. Update Driven Approach: The information from multiple heterogeneous sources


are integrated in advance and are stored in a warehouse. This information is
available for direct querying and analysis.

BITS Pilani, Pilani Campus


Query Driven Vs Update Driven Approach

Query Driven Update Driven


Approach • When a query is issued on a client • Information from multiple
side, a metadata dictionary translates heterogeneous sources are
the query into an appropriate form for integrated in advance and are
individual heterogeneous sites stored in a warehouse.
involved. • This information is available
• Now these queries are mapped and for direct querying and
sent to the local query processor. analysis.
• The results from heterogeneous sites
are integrated into a global answer set

BITS Pilani, Pilani Campus


Query Driven Vs Update Driven Approach – Pros & Cons
Query Driven Update Driven

Advantages • Data does not need to be stored in a • This approach provide high
dedicated DW performance.
• The data is copied, processed,
integrated, annotated,
summarized and restructured in
semantic data store in advance.
• Query processing does not
require an interface to process
data at local sources.
Disadvantages • Query-driven approach needs complex • Need to have a DW with ample
integration and filtering processes. storage to store the data.
• This approach is very inefficient.
• It is very expensive for frequent queries.
• This approach is also very expensive for
queries that require aggregations.

BITS Pilani, Pilani Campus


Query Driven Vs Update Driven Approach – Summary
Characteristic Query-Driven Approach Update-Driven Approach
Data is integrated at the time of query Data is integrated in advance and
Data Integration Process
execution. stored in the warehouse.
Real-time or near real-time data Periodic updates with a potential
Data Freshness
availability. latency between updates.
Can be impacted by the need to integrate Generally faster querying as data is
Query Performance
data in real-time. preintegrated.
Complex, especially for real-time Easier to implement and maintain as
Complexity of Implementation
integration requirements. integration is scheduled.

May result in varied data formats and Provides a consistent and


Consistency and Standardization
structures. standardized view of integrated data.

Resources are used at the time of query Resources are used during scheduled
Resource Utilization
execution. update processes.
Suitable for scenarios requiring real-time Suitable for scenarios where periodic
Use Case Suitability
data access. updates are acceptable.
Real-time analytics, dashboards, Data warehousing with nightly or
Examples
operational reporting. weekly batch update

BITS Pilani, Pilani Campus


Knowledge Check

Q1. Which of the following is a primary goal of a Data Warehouse?


A. Real-time data processing
B. Storing raw data without transformation
C. Supporting online transaction processing
D. Providing a centralized and unified view of data for analysis
Answer: D. Providing a centralized and unified view of data for analysis

Q2. Which of the following is NOT a characteristic of a Data Warehouse?


A. Subject-oriented
B. Integrated
C. Online Transaction Processing (OLTP)
D. Time-variant
Answer: C. Online Transaction Processing (OLTP)

BITS Pilani, Pilani Campus


Knowledge Check

Q3. What is the role of metadata in a Data Warehouse?


A. Storing primary data
B. Managing data security
C. Providing information about data characteristics and relationships
D. Executing data transformations
Answer: C. Providing information about data characteristics and relationship

Q4. How does a Data Warehouse contribute to data consistency and accuracy?
A. By storing raw data without transformation
B. By providing real-time data updates
C. By integrating data from various sources into a unified view
D. By limiting access to historical data
Answer: C. By integrating data from various sources into a unified view

BITS Pilani, Pilani Campus


Knowledge Check

Q6. Why is time-variant data important in a Data Warehouse?


A. Time-variant data is not relevant in Data Warehousing.
B. Time-variant data allows for the analysis of historical trends and changes over time.
C. Time-variant data focuses on real-time data processing.
D. Time-variant data increases data complexity.
Answer: B. Time-variant data allows for the analysis of historical trends and changes over time.

Q7. What is the significance of business intelligence tools in the context of Data Warehousing?
A. Business intelligence tools are not applicable to Data Warehousing.
B. Business intelligence tools help in real-time data processing.
C. Business intelligence tools enable the analysis and visualization of data stored in the Data Warehouse.
D. Business intelligence tools are only used for data extraction.
Answer: C. Business intelligence tools enable the analysis and visualization of data stored in the Data Warehouse

BITS Pilani, Pilani Campus


Thank you

BITS Pilani, Pilani Campus


Data Warehouse - Architecture

Instructor-in-Charge:
Sachin Arora, Guest faculty
BITS Pilani BITS Pilani
Pilani Campus
Module 2

BITS Pilani, Pilani Campus


Module 2 – Data Warehouse Architecture

2.1 Data Mart Vs Data Warehouse Vs Operational Data Store


2.2 Data Warehousing Architecture Types
2.3 Top-Down Vs Bottom-up approach to Data Warehousing
2.4 Snowflake Cloud Data Warehouse - Architecture

BITS Pilani, Pilani Campus


Types of Data Warehouse

Enterprise Data Warehouse (EDW)

Operational Data Store (ODS)

Data Mart

BITS Pilani, Pilani Campus


Recap – DW Vs. ODS Vs. Data Mart
Enterprise Data Warehouse Operational Data Store
Criteria Data Mart
(EDW) (ODS)
Department or business function- Real-time or near-real-time
Scope Organization-wide
specific operational data
Comprehensive analysis Focused analysis for specific Real-time transactional
Purpose
across the enterprise business area processing
Current snapshot of
Data Size Large volumes of data Smaller subset of data from EDW
operational data
High integration of data from Integration focused on specific Real-time integration with
Integration Level
various sources business area operational systems
Fine-grained data for Can be fine-grained or summarized Real-time, detailed
Granularity
enterpriselevel insights based on departmental needs transactional data
Broad range of users across Specific departmental or team Users requiring real-time
Users
the organization users operational data
Generally batch-oriented;
Can have shorter update cycles Near-real-time or real-time
Data Latency may have longer update
based on specific needs updates
cycles
An ODS for a
An EDW for a multinational A sales data mart for a retail
telecommunications
Example corporation integrating data company, focused on salesrelated
company tracking real-time
from finance, sales, HR, etc. data
network performance.

BITS Pilani, Pilani Campus


Data Warehouse Architecture

BITS Pilani, Pilani Campus


Data Warehouse Architecture

Data warehouses and their architectures vary depending upon the


elements of an organization's situation.
Three common architectures are:
• Data Warehouse Architecture: Single Tier
• Data Warehouse Architecture: Two Tier
• Data Warehouse Architecture: Three Tier

BITS Pilani, Pilani Campus


Data Warehouse Architecture – Single Tier

Operational System: An operational system refer to a system that is used to process the
day-today transactions of an organization. Operational systems are the source systems
that provide raw data to a data warehouse.

Flat Files: A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.

The single-tier architecture is not a frequently


practiced approach.

The main goal of having such an architecture is


to remove redundancy by minimizing the
amount of data stored

This data warehouse architecture has no


staging area or data marts and is not
implemented in real-time systems.

This architecture type works best for


processing organization operational data

BITS Pilani, Pilani Campus


Data Warehouse Architecture – Two Tier

A two-tiered architecture format separates the business layer from the analytical
area, thereby giving the warehouse more control and efficiency over its processes.
The two-tiered architecture contains a source layer and data warehouse layer and
follows a 2-layer data flow process: 1. ETL 2. Data Warehouse

A two-tier architecture includes a


staging area for all data sources, before
the data warehouse layer.

By adding a staging area between the


sources and the storage repository, you
ensure all data loaded into the
warehouse is cleansed and in the
appropriate format.

BITS Pilani, Pilani Campus


Data Warehouse Architecture – Three Tier
This architecture has three layers: the source, ETL, and data warehouse layer w/ a layer of
data, which is consumable.

The reconciled layer in this architecture sits between the source and data warehouse layer
and acts as a standard reference for an enterprise data model.

However, although this layer introduces better data quality for warehouses, the additional
storage space incurs extra costs.

BITS Pilani, Pilani Campus


Data Warehouse – ETL
Aspect Single-Layer Architecture Two-Layer Architecture Three-Layer Architecture
Number of Layers 1 2 3
Separated from processing Separated from processing
Data Storage Integrated into a single database
layer and presentation layers
Combined with data storage in a
Data Processing Dedicated processing layer Separate processing layer
single layer
Combined with data storage in a Integrated with data
Data Presentation Separate presentation layer
single layer processing layer
Limited scalability due to the single Improved scalability Modular design supports
Scalability
layer compared to single-layer better scalability
May face performance issues as data Can provide better Balanced performance due
Performance
grows performance optimization to modular design
Less flexible to changes in data Moderate flexibility with More flexible and adaptable
Flexibility
sources separation of layers to changes
Moderate complexity with a More complex due to the
Complexity Simple and easy to understand
separation of layers presence of three layers
More cost-effective for smaller-scale Moderate cost, can be Potentially higher
Cost-Effectiveness
projects higher than single-layer implementation costs
Limited control over security Can have better control over More granular control over
Security
measures security measures security measure

BITS Pilani, Pilani Campus


Data Warehouse – ETL

BITS Pilani, Pilani Campus


ETL

• ETL, which stands for “extract, transform, load,” are the three processes that move data from
various sources to a unified repository—typically a Data Warehouse. It enables data analysis
to provide actionable business information, effectively preparing data for analysis and
business intelligence processes.
• As data engineers are experts at making data ready for consumption by working with multiple
systems and tools, data engineering encompasses ETL.
• Data engineering involves ingesting, transforming, delivering, and sharing data for analysis.
• These fundamental tasks are completed via data pipelines that automate the process in a
repeatable way.
• A data pipeline is a set of data-processing elements that move data from source to
destination, and often from one format (raw) to another (analytics-ready).

BITS Pilani, Pilani Campus


Purpose of ETL

• Purpose: ETL allows businesses to consolidate data from multiple


databases and other sources into a single repository with data that has
been properly formatted and qualified in preparation for analysis. This
unified data repository allows for simplified access for analysis and
additional processing. It also provides a single source of truth, ensuring
that all enterprise data is consistent and up-to-date.

BITS Pilani, Pilani Campus


Extraction (E)

▪ Extraction, in which raw data is pulled from a source or multiple sources. Data
could come from transactional applications, such as customer relationship
management (CRM) data from Salesforce or enterprise resource planning (ERP)
data from SAP, or Internet of Things (IoT) sensors that gather readings from a
production line or factory floor operation. For example. To create a data
warehouse, extraction typically involves combining data from these various sources
into a single data set and then validating the data with invalid data flagged or
removed. Extracted data may be several formats, such as relational databases,
XML, JSON, and others.

• Advantage:
✓ Isolates the operational systems from the analytical systems, preventing impact
on performance.
✓ Enables the extraction of only relevant data, optimizing resources and reducing
unnecessary load on the data warehouse.

BITS Pilani, Pilani Campus


Transform (T)
▪ Transformation, in which data is updated to match the needs of an organization
and the requirements of its data storage solution. Transformation can involve
standardizing (converting all data types to the same format), cleansing (resolving
inconsistencies and inaccuracies), mapping (combining data elements from two or
more data models), augmenting (pulling in data from other sources), and others.
During this process, rules and functions are applied, and data cleansed to prevent
including bad or non-matching data to the destination repository. Rules that could
be applied include loading only specific columns, deduplicating, and merging,
among others.

• Advantage:
✓ Ensures data consistency and quality by applying business rules and validation checks.
✓ Allows for the integration of data from multiple sources into a unified and standardized
format.

BITS Pilani, Pilani Campus


Loading (L)

▪ Loading, in which data is delivered and secured for sharing, making


business-ready data available to other users and departments, both within
the organization and externally. This process may include overwriting the
destination’s existing data.

• Advantage:

✓ Populates the data warehouse with organized, high-quality data for reporting
and analysis.
✓ Supports incremental loading, allowing only new or changed data to be loaded,
reducing processing time.

BITS Pilani, Pilani Campus


ETL Vs ELT
ELT (Extract, Load & Transform) is a variation in which:
• Data is extracted, first loaded and then transformed.
• This sequence allows businesses to preload raw data to a place where it
can be modified.
• ELT is more typical for consolidating data in a data warehouse, as cloud-
based data warehouse solutions are capable of scalable processing.
• Extract, transform, load is especially conducive to advanced analytics. For
example, data scientists commonly load data into a data lake and then
combine it with another data source or use it to train predictive models.
Maintaining the data in a raw (or less processed) state allows data
scientists to keep their options open. This approach is quicker as it
leverages the power of modern data processing engines and cuts down on
unnecessary data movement.

BITS Pilani, Pilani Campus


Data Pipeline - A Complete Picture
Data Source Data Storage Info Delivery
Metadata
Management &
Control
Email
E TRANSFORM
Internal Data
X
T L Data Warehouse

R O
A A Cube
External Data D
C
Staging Area
T

Data Mining

Data Mart
Archive Data

Ad-Hoc Reporting

Popular ETL Tools: Fivetran, Matilion, AWS Glue, Azure Data Factory, Google Dataflow, Airbyte (open
source), Apache Nifi (Open source)AbInitio etc.

BITS Pilani, Pilani Campus


Data Pipeline - A Complete Picture

• Source Data Component: Source data coming into the data warehouses may be
grouped into four broad categories:
Example: Transactional databases, CRM systems, ERP systems, external data feeds.

Production Data: This type of data comes from the different operating systems of the
enterprise.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part
of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.

BITS Pilani, Pilani Campus


Data Pipeline - A Complete Picture

•Data Staging Component


❑ After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse.
❑ The extracted data coming from several different sources need to be changed, converted,
and made ready in a format that is relevant to be saved for querying and analysis.
• Data Storage Components: Data storage for the data warehousing is a split repository. The data
repositories for the operational systems generally include only the current data. Also, these data
repositories include the data structured in highly normalized for fast and efficient processing
• Information Delivery Component: The information delivery element is used to enable the process
of subscribing for data warehouse files and having it transferred to one or more destinations
according to some customer-specified scheduling algorithm.
• Management and Control Component:
❑ The management and control elements coordinate the services and functions within the data
warehouse.
❑ Control the data transformation and the data transfer into the data warehouse storage.
❑ Moderates the data delivery to the clients.
❑ Authorizes data to be correctly saved in the repositories.
❑ It monitors the movement of information into the staging method and from there into the
data warehouses storage itself.

BITS Pilani, Pilani Campus


Data Pipeline - A Complete Picture

The following are the functions of data warehouse tools and


utilities –
• Data Extraction - Involves gathering data from multiple
heterogeneous sources.
• Data Cleansing - Involves finding and correcting the errors in
data.
• Data Transformation - Involves converting the data from legacy
format to warehouse format.
• Data Loading - Involves sorting, summarizing, consolidating,
checking integrity, and building indices and partitions.
• Refreshing - Involves updating from data sources to warehouse.

BITS Pilani, Pilani Campus


Approach to Data Warehouse
• There can be two approaches in Data Warehouse:

➢ Top-Down Approach
➢ Bottom-Up Approach

BITS Pilani, Pilani Campus


Top-Down Approach

BITS Pilani, Pilani Campus


Top-Down Approach
Characteristic
• This approach was coined by Inmon, and data warehouse in this approach acts as a
central information repository for the complete enterprise, and then the data marts are
created from it after the complete data warehouse has been set up.
• A global data model is designed to cater to the common information needs of the
organization.
• Data Marts: Data marts are then derived from the global data warehouse to address the
specific needs of individual departments.

Advantages
• Consistency and Integration: Promotes consistency and integration across the entire
organization.
• Centralized Control: Ensures centralized control over data definitions, standards, and
security.
• Holistic Perspective: Provides a holistic view of the organization's data, facilitating
enterprise-wide reporting and analysis.

BITS Pilani, Pilani Campus


Top-Down Approach - Challenges

• Longer Implementation Timelines: May have longer implementation timelines


due to the comprehensive nature of designing a centralized data warehouse.
• Resistance from Departments: May face resistance from individual departments
that prefer autonomy in managing their data.

BITS Pilani, Pilani Campus


Bottom-Up Approach

BITS Pilani, Pilani Campus


Bottom-Up Approach
Characteristic
• This approach was coined by Kimball as it can be defined as data after extraction from the
source is cleansed and transformed by the staging area, after which the data is sent to the
data marts of each theme/subject, and then it is loaded up in the data warehouse.
• In a bottom-up approach, the focus is on creating smaller data marts to address the
specific needs of individual departments or business units.
• Over time, these departmental data marts may be integrated to for a larger, enterprise-
wide data warehouse.

Advantages
• Faster Implementation: Allows for faster implementation as it addresses the immediate
needs of individual departments.
• Departmental Autonomy: Departments have greater control over their data and can
implement solutions more quickly.

BITS Pilani, Pilani Campus


Bottom-Up Approach - Challenges
• Potential for Data Redundancy: May lead to data redundancy and
inconsistency across different data marts.
• Integration Challenges: Integration challenges may arise when
trying to combine data marts into an enterprise-wide data
warehouse.

BITS Pilani, Pilani Campus


MPP System – Massive Parallel Processing

BITS Pilani, Pilani Campus


What is MPP System?
➢ MPP stands for Massive Parallel Processing

In order to understand popular data warehouses like Snowflake, you first need to
understand their underlying architecture and the core principles upon which they
are built. Massively Parallel Processing (or MPP for short) is this underlying
architecture. Here, we’ll dive into what an MPP Database is, how it works, and
the strengths and weaknesses of Massively Parallel Processing
MPP Architecture
MPP Architecture
• An MPP database is a type of database or data warehouse where the data
and processing power are split up among several different nodes (servers),
with one leader node and one or many compute nodes.
• In MPP, the leader (you) would be called the leader node - you’re the telling
all the other people what to do and sorting the final tally.
• The library employees, your helpers, would be called compute nodes -
they’re dealing with all the data, running the queries and counting up the
words. MPP databases can scale horizontally by adding more compute
resources (nodes), rather than having to worry about upgrading to more and
more expensive individual servers (scaling vertically).
• Adding more nodes to the cluster allows the data and processing to be
spread across more machines, which means the query will be completed
sooner
Database (DB) Vs. Data Warehouse (DW)

Leader Node
(aka Controller Node)

High Availability Cluster

Compute Node with Storage

Storage Area Network (SAN)


MPP Architecture
Key Takeaway

✓ MPP is a Data Warehouse system that concentrates on parallel


processing software and hardware
✓ Query processing is divided into numerous smaller parallel jobs done
concurrently across multiple servers
✓ Significantly reduces query and ingestion times
Popular MPP systems
Introduction to Snowflake

➢ Snowflake was founded by few Employees of Oracle, who launched Snowflake in 2012.
➢ Snowflake’s Data Cloud is powered by an advanced data platform provided as Software-
as-a-Service (SaaS)
➢ Snowflake enables data storage, processing, and analytic solutions that are faster, easier
to use, and far more flexible than traditional offerings.
➢ The Snowflake data platform is NOT built on any existing database technology or “big
data” software platforms such as Hadoop – it has been developed from scratch
➢ Snowflake combines a completely new SQL query engine with an innovative architecture
natively designed for the cloud.
Snowflake Architecture - Introduction

Cloud Agnostic Layer


➢ Unlike many cloud data warehouses, Snowflake doesn’t run on its own cloud.
➢ It uses the storage and compute resources from popular cloud computing platforms –
Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP).
➢ Snowflake cannot run on a private cloud infrastructure, either on-premises or hosted.
➢ There is no hardware (virtual or physical) to
select, install, configure, or manage.
➢ There is virtually no software to install,
configure, or manage.
➢ Ongoing maintenance, management, upgrades,
and tuning are handled by Snowflake.
Snowflake Editions
Snowflake Architecture
• Snowflake’s architecture is a hybrid of
traditional shared-disk and shared-
nothing database architectures. Similar
to shared-disk architectures,
Snowflake uses a central data
repository for persisted data that is
accessible from all compute nodes in
the platform.
• But similar to shared-nothing
architectures, Snowflake processes
queries using MPP (massively parallel
processing) compute clusters where
each node in the cluster stores a
portion of the entire data set locally.
• This approach offers the data
management simplicity of a shared-
disk architecture, but with the
performance and scale-out benefits of
a shared-nothing architecture.
Snowflake Architecture - Storage
Database Storage
• When data is loaded into Snowflake, Snowflake reorganizes that data into its internal
optimized, compressed, columnar format. Snowflake stores this optimized data in
cloud storage.
• Snowflake manages all aspects of how this data is stored — the organization, file size,
structure, compression, metadata, statistics, and other aspects of data storage are
handled by Snowflake.
• The data objects stored by Snowflake are not directly visible nor accessible by
customers; they are only accessible through SQL query operations run using
Snowflake.
• Snowflake stores this optimized data in cloud storage
• Data stored in Storage in immutable
Snowflake Architecture – Virtual Warehouses
Query Processing
• Query execution is performed in the processing layer.
• Snowflake processes queries using “virtual warehouses”.
• Each virtual warehouse is an independent compute cluster that does not share
compute resources with other virtual warehouses. As a result, each virtual
warehouse has no impact on the performance of other virtual warehouses.
• Sizing of Warehouse is done in form of T-Shirt Sizing.
• Multiple Warehouse can access the same data in parallel
• Scale up/down or Scale out can be done in a second.
Snowflake Architecture – Cloud Services
• This is “MAGIC” Layer of Snowflake, which acts like “Brain”
• Cloud services layer is completely managed by Snowflake
• It is a collection of services that coordinates various activities in Snowflake
• Collection of Services include:
❑Authentication
❑Infrastructure Management
❑Metadata Management
❑Query parsing and optimization
❑Access Control
• The cloud services layer also runs on compute instances provisioned
and managed by Snowflake.
Warehouse

➢ Warehouse (WH) are compute engine of Snowflake


➢ WH are required for running any queries as well as DML
operations
➢ WH is defined by its Size, like XS, S, M, L, XL etc.
➢ WH can be stopped or started at any time
➢ Charges to WH only applies when they are running
➢ Recommendation is to enable Auto-Resume & Auto-Pause
configuration on WH to optimize the cost.
➢ WH can be resized on the fly (w/o any downtime)
Warehouse Architecture – Multi-Cluster WH Example

Query 1 Query 3
Query 2

Warehouse

Cluster 1 Cluster 2
Warehouse size & Credits

https://docs.snowflake.com/en/user-guide/warehouses-overview
Snowflake Regions
Region Considerations

➢ Users Proximity
➢ Network Latency
➢ Cost Implications – Cost is NOT same across all regions (Link: Snowflake
Pricing & Cost Structure | Snowflake Data Cloud)
➢ Cloud Service Provider Region from where data will be loaded into Snowflake
➢ Regional Laws

As of Dec’2024, there are no egress charges for Snowflake


Account Identifier

➢ A hostname for a Snowflake account starts with an account identifier and ends
with the Snowflake domain (snowflakecomputing.com).
➢ Snowflake supports two formats to use as the account identifier in your
hostname :
• Account Name (Recommended)
• Account Locator (Legacy)

Example:
Account Name:
https://<orgname>-<account_name>.snowflakecomputing.com
Account Locator: https://<accountlocator>.<region>.<cloud>.snowflakecomputing.com
Snowflake Demo
Knowledge Check

Q1. Which Data Warehouse component is responsible for supporting end-user


reporting and querying?
A. Staging Area
B. Data Mart
C. OLAP Server
D. Metadata Repository
Answer: C

Q2. What is the purpose of a Staging Area in Data Warehouse architecture?


A. Real-time data storage
B. Temporary storage for raw data before transformation
C. Final storage for analytical data
D. Query optimization
Answer: B
Knowledge Check

Q3. What is the primary advantage of using a Data Mart in a Data Warehouse architecture?
A. Centralized storage for all organizational data
B. Improved query performance for specific business units
C. Real-time data processing capabilities
D. Efficient support for complex ETL processes
Answer: B

Q4. Which of the following is one of the function of Cloud Services in Snowflake
Cloud Data Warehouse architecture?
A. Process the query
B. Ingest Data into Warehouse
C. Authentication & Access
D. Persists Data
Answer: C
Knowledge Check

Q5. How many Warehouses can be created in Snowflake Cloud Data Warehouse
A. 4
B. 2
C. 1
D. No Limit
Answer: D

Q6. Which of the following is NOT a valid size of Snowflake Warehouse


A. Large
B. 2XL
C. XS
D. 2XS
Answer: D
Thank you

BITS Pilani, Pilani Campus


Dimensional Modelling

Instructor-in-Charge:
Sachin Arora, Guest faculty
BITS Pilani BITS Pilani
Pilani Campus
Module 3

BITS Pilani, Pilani Campus


Module 3 – Introduction to Dimensional Modelling

3.1 - Dimensional Modelling


3.2 - ER Modelling vs Dimensional Modelling
3.3 - Star, Snowflake, Starflake schemas
3.4 - Data Warehouse design steps
3.5 - Slowly changing dimensions
3.6 - Types of facts, dimensions
3.7 – Retail, Inventory Case Study

BITS Pilani, Pilani Campus


Dimensional Modelling

BITS Pilani, Pilani Campus


Dimensional Modelling

• Dimensional Modeling (DM) is a data structure technique optimized for


data storage in a Data warehouse.
• The purpose of dimensional modeling is to optimize the database for
faster retrieval of data.
• A dimensional model in data warehouse is designed to read, summarize,
analyze numeric information like values, balances, counts, weights, etc.
in a data warehouse.
• In contrast, relation models are optimized for addition, updating and
deletion of data in a real-time Online Transaction System.

BITS Pilani, Pilani Campus


Elements of Dimensional Modelling

Fact:
➢ Facts are the measurements/metrics or facts from your business process.
➢ Facts are usually numeric and represent key performance indicators (KPIs) or
business metrics.
➢ Example: In a sales data warehouse, a fact might be the total sales amount,
the quantity of products sold, or the number of orders placed.

Key Characteristics:
➢ Numeric and measurable.
➢ Represent the business metrics or KPIs.
➢ Often involve aggregations and calculations.

BITS Pilani, Pilani Campus


Elements of Dimensional Modelling

Fact:
➢ Measurable Metrics: In a retail data warehouse, a fact could be the "Total Sales
Amount" for a specific day, week, or month.
➢ Aggregated Values: A fact might represent the "Total Quantity Sold" of a
particular product across multiple stores.
➢ Count-based Metrics: In a call center data warehouse, a fact could be the
"Number of Calls Received" during a specific time period.
➢ Duration-based Metrics: In a service-level agreement (SLA) monitoring system,
a fact might be the "Total Downtime" of a system in hours.
➢ Financial Metrics: For a financial institution, a fact could be the "Total Assets
Under Management" for a specific portfolio.

BITS Pilani, Pilani Campus


Elements of Dimensional Modelling

Dimension:
• A dimension provides context and additional descriptive information about the data
in a data warehouse.
• Dimensions are the descriptive data elements that are used to categorize or classify
the data.
• Dimensions are typically non-numeric attributes that help to categorize, filter, and
analyze the facts.
Example: In the same sales data warehouse, dimensions might include information
about products (e.g., product category, brand), customers (e.g., customer name,
region), and time (e.g., date, month, year).
Customer Dimension: Customer Name, Address, and Customer Segment.
Geographical Dimension: City, State, and Country.
Employee Dimension: Employee Name, Department, and Job Title.

BITS Pilani, Pilani Campus


Elements of Dimensional Modelling
Fact Table
• Stores the performance measurements resulting from a business process.
• In a dimensional data model, the fact table is the central table that contains the
measures or metrics of interest, surrounded by the dimension tables that describe
the attributes of the measures.
Dimension Table
• Dimension tables contain the details about the facts.
• The dimension tables are related to the fact table through foreign key relationships.
• Dimension tables are simply denormalized tables.
• The dimensions can be having one or more relationships.

BITS Pilani, Pilani Campus


Dimensional Modelling - Example

BITS Pilani, Pilani Campus


Dimension Tables
• These are the tables that are joined to fact tables.
• It describes the “who, what, where, when, how, and why” associated with the
business event.
• It contains the descriptive attributes used for grouping and filtering the facts.
• Dimensions describe the measurements of the fact table.
• For example, customer id is a measurement, but we can describe its attributes
further, more as what is the name of the customer, the address of the customer,
gender, etc.

BITS Pilani, Pilani Campus


Dimension Tables - Example

BITS Pilani, Pilani Campus


Dimension Tables - Example

BITS Pilani, Pilani Campus


Dimension Tables Vs Fact Tables

Aspect Fact Table Dimension Table

Contains descriptive, non-numeric attributes that


Contains quantitative, numeric data
provide context to the facts. Examples include
Content (facts). Examples include sales amounts,
customer names, product details, and time
quantities, and metrics.
attributes.

Typically has a finer granularity,


Represents aggregated and summarized data at
Granularity representing the lowest level of detail in
various levels.
the data warehouse.

Usually has a composite primary key


Has a single primary key, often a surrogate key,
Primary Key composed of foreign keys from
which is unique to each dimension record.
dimension tables.

Contains foreign keys linking to various


May have foreign keys linking to higher level
Foreign Keys dimension tables, establishing
dimensions or to other related dimensions.
relationships.

BITS Pilani, Pilani Campus


Dimension Tables Vs Fact Tables
Aspect Fact Table Dimension Table
Contains measures or metrics that are Does not contain measures; instead, it
Measures subject to analysis, aggregation, and contains descriptive attributes for
reporting. categorization and filtering.
Typically does not involve pre-aggregated
Data is pre-aggregated to support efficient
Aggregation data; instead, aggregations are performed
query performance for analytical reporting.
during querying.
May incorporate historical data through Historical changes are less common;
Changes Over Time techniques like slowly changing dimensions dimensions often reflect the current state of
(SCD). attributes.
Tends to have a larger volume of data due to Generally has a smaller volume compared to
Size and Volume
detailed and fine-grained information. fact tables due to its descriptive nature.
Indexing strategies are critical for
Indexing is still important but may not be as
Indexes performance optimization, especially for
critical as in fact tables for query performance.
large fact tables.
Sales fact table, containing daily sales Customer dimension table, containing
Example Use Cases transactions with measures like sales attributes such as customer name, address,
amount and quantity sold. and demographic details.
Central component in a star schema or Forms the outer layer of a star schema,
Star Schema snowflake schema, connecting to multiple providing context to the measures in fact
dimensions. tables

BITS Pilani, Pilani Campus


Types of Dimensions

Conformed Dimension:
Definition: A conformed dimension is a dimension that is shared and consistent across multiple data
marts or data warehouses within an organization.
Use Case: For example, a "Time" dimension with consistent date hierarchies and attributes is conformed
across various business units or departments.

Outrigger Dimension:
Definition: An outrigger dimension is an additional set of attributes associated with an existing
dimension table. It extends the information available in the primary dimension.
Use Case: Adding supplementary attributes like additional customer details (e.g., social media handles)
to an existing "Customer" dimension.

Shrunken Dimension:
Definition: A shrunken dimension is a subset of a larger dimension that is specific to a particular
department or business process. It's derived from a larger, enterprise-wide dimension.
Use Case: A "Product" dimension that is a subset focused on a specific product category or line within a
larger enterprise-wide "Product" dimension

BITS Pilani, Pilani Campus


Types of Dimension

Role-Playing Dimension:
Definition: A role-playing dimension is a single dimension table that is used in multiple ways (roles)
within the same fact table. Each role represents a different perspective or context.
Use Case: Using a "Date" dimension for order date, ship date, and delivery date within the same fact
table.

Dimension to Dimension Table:


Definition: A dimension-to-dimension table, also known as a bridge table, is used to represent a
many-to-many relationship between two dimensions in a fact table.
Use Case: Representing the relationship between "Employees" and "Projects" where an employee
can work on multiple projects, and a project involves multiple employees.

Degenerate Dimension:
Definition: A degenerate dimension is a dimension that is derived from a fact table and does not
have its own dedicated dimension table. It's a grouping of attributes without a separate entity.
Use Case: Using an invoice number as a degenerate dimension in a sales fact table

BITS Pilani, Pilani Campus


Types of Dimension

Swappable Dimension:
Definition: A swappable dimension allows for the substitution of one dimension for another
without impacting the structure or integrity of the data warehouse.
Use Case: Substituting a "Product" dimension with a "Promotional Event" dimension for a
specific analysis without modifying the schema.

Step Dimension:
Definition: A step dimension represents a sequence or steps in a process, often used to track
the progress of an entity through different stages.
Use Case: Modeling the stages of a customer's journey, where each stage represents a step in
the dimension.

BITS Pilani, Pilani Campus


Few Examples

Conformed Dimension

A conformed dimension is any dimension that is shared across multiple fact tables or subject areas. The
diagram below shows a simple conceptual model of conformed dimension.

BITS Pilani, Pilani Campus


Few Examples

Role-Playing Dimension

A role-playing dimension is a dimension that can filter related facts differently. For example, the
date dimension table has two relationships to the online transaction facts. The same dimension table
can be used to filter the facts by ship date or order date.

BITS Pilani, Pilani Campus


Steps to Create Dimensional Data Modelling

Select the
Decide the Identify Identify
Business
Grain Dimension Facts
Process

BITS Pilani, Pilani Campus


Step 1: Identify Business Process

Step 1: Select the Business Process

The first step involves selecting the business process, and it should be an action resulting in
output.

Business Process #1:The e-Commerce industry is widely known for selling and buying goods
over the internet, so our first business process will be the products bought by the
customers.

Business Process #2: Delivery status is also one of the most important business processes in
this industry. It tells us where the product is currently from. It’s dispatched from the
warehouse to the customer’s given address.

Business Process #3: Maintaining the inventory in order to ensure that items don’t run out
of stock, how sales are going on etc.

BITS Pilani, Pilani Campus


Step 2: Decide the Grain

• Identifying the grain, or the level of detail, is a crucial step in dimensional modeling.
• The grain defines what each row in a fact table represents
• A grain is a business process at a specified level.
• All the rows in a fact table should result from the same grain.
• Each fact table is the result of a different grain selected in a business process.
• The grain should be as granular (at the lowest level) as possible.
• To identify the grain, you need a clear understanding of the business processes and the
specific information that stakeholders want to analyze.

Grains for the above business processes are


Grain 1: We can have the grain as the products purchased by the customer, i.e., each row of the fact
table will represent all the products checked out by the customer from the cart but suppose a
customer ordered 100 products, so this will be represented as a single row. Grain will be an individual
product ordered by a customer, i.e., one product per row. This will make the data simple and easy to
query. Similarly, we will select the most granular grains for the remaining processes.
Grain 2: Here also, the grain will be the status of an individual product shipped from the warehouse to
the delivery location.
Grain 3: Here, each row will represent the daily inventory for each product in each store., it will tell the
stock of that product left in the inventory and how many products have already been sold.

BITS Pilani, Pilani Campus


Step 2: Decide the Grain - Examples

Sales Data:
Business Process: Recording sales transactions.
Grain: Each row in the fact table represents a unique sales transaction, capturing details such as the product sold,
quantity, price, and customer.
Customer Interaction:
Business Process: Capturing customer interactions in a call center.
Grain: Each row represents a specific customer interaction, including details such as the call duration, customer,
agent, and date.
Inventory Management:
Business Process: Managing inventory levels.
Grain: Each row represents the inventory level for a specific product at a specific location on a specific date.
Web Analytics:
Business Process: Tracking user interactions on a website.
Grain: Each row represents a specific user session, capturing details such as page views, time spent, and actions
taken.
Employee Time Tracking:
Business Process: Recording employee work hours.
Grain: Each row represents the work hours logged by an employee for a specific date.
Healthcare Data:
Business Process: Recording patient encounters in a healthcare system.
Grain: Each row represents a patient encounter, capturing details such as the patient, healthcare provider, date, and
services provided.

BITS Pilani, Pilani Campus


Step 3: Identify the Dimensions for the Dimensional Table
• It describes the “who, what, where, when, how, and why” associated with the business
event. It contains the descriptive attributes used for grouping and filtering the facts.

Consider following dimension

BITS Pilani, Pilani Campus


Step 4: Identify the Facts for the Dimensional Table
• The term fact represents a business measure; therefore, a fact table in dimensional
modeling stores the performance measurements resulting from a business process.
• These performance measurements measure the business, i.e., these are the metrics
through which we can infer whether our business is in profit or loss.
• Different business measurements can be unit price, number of goods sold, etc.

Grain: Individual product of the order per row.

BITS Pilani, Pilani Campus


Schema

• Schema means the logical description of the entire database.


• It gives us a brief idea about the link between different database tables through keys and
values.
• A data warehouse also has a schema like that of a database.
• In database modeling, we use the relational model schema.
• Whereas in the data warehouse, we use modeling Star, Snowflake, and Galaxy schema.

BITS Pilani, Pilani Campus


Star Schema
• A star schema is a type of data warehouse schema where a central fact table is connected to
one or more dimension tables through foreign key relationships.
• This structure resembles a star when visualized, with the fact table at the center and
dimension tables surrounding it like points on a star.
• The star schema is a popular design for data warehouses due to its simplicity and efficiency in
querying and reporting.

BITS Pilani, Pilani Campus


Star Schema – Advantages & Disadvantages

ADVANTAGES:
1. Most Suitable for Query Processing: View-only reporting applications show enhanced
performance.
2. Simple Queries: Optimized Navigation through the database. It is because the star
join schema logic is much simpler.
3. Simplest and Easiest to design.

DISADVANTAGES:
1. They don’t support many to many relationships# between business entities.
2. It is a result of each dimension having only one dimension table which may cause
data redundancy. For example, a star schema would repeat the values in
field customer_address_country for each order from the same country.

# A many-to-many relationship occurs when multiple records in a


table are associated with multiple records in another table.

BITS Pilani, Pilani Campus


How to create Schemas using SQL Server or PgAdmin Tool

Click Here for SQL Server

Click Here for PgAdmin

BITS Pilani, Pilani Campus


Snowflake Schema
• A snowflake schema is equivalent to the star schema.
• The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship.
• It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake.
• Snowflaking is a method of normalizing the dimension tables in a STAR schemas.
• When we normalize all the dimension tables entirely, the resultant structure resembles
a snowflake with the fact table in the middle.
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

Click here to learn more about normalization

BITS Pilani, Pilani Campus


Snowflake Schema - Example

BITS Pilani, Pilani Campus


Snowflake Schema

• Central Fact Table: Similar to the star schema, there is a central fact table that contains
the primary metrics and measures.
• Normalized Dimension Tables: Unlike the star schema, the dimension tables in a
snowflake schema are normalized, meaning that they are divided into multiple related
tables.
• Normalization involves breaking down large tables into smaller ones to reduce
redundancy and improve data integrity.
• Hierarchical Structure: The normalized dimension tables are organized into a hierarchy,
forming a structure that resembles a snowflake. Each level of the hierarchy represents
a subset of the data.

The primary advantage of the snowflake schema is its ability to reduce data redundancy
and improve data integrity through normalization

BITS Pilani, Pilani Campus


Snowflake Schema – An Example

BITS Pilani, Pilani Campus


Snowflake Schema – Downsides
There are trade-offs associated with this approach, which are:

• Query Performance: While normalization can improve data integrity, it can also result in
more complex queries, as analysts may need to join multiple tables to retrieve the
required information. This can impact query performance.

• Maintenance Complexity: The snowflake schema can be more complex to design and
maintain compared to the star schema. Changes to the schema structure may require
modifications to multiple tables.

The choice between a star schema and a snowflake schema depends on the specific requirements of the
data warehouse and the trade-offs that the organization is willing to make

BITS Pilani, Pilani Campus


Star Schema Vs Snowflake Schema

• STAR schema for sales in a manufacturing company.


• The sales fact table include quantity, price, and other relevant metrics. SALESREP,
CUSTOMER, PRODUCT, and TIME are the dimension tables.

BITS Pilani, Pilani Campus


Star Schema Vs Snowflake Schema – An Example

→ The STAR schema for sales, as shown in the example, contains only five tables, whereas the normalized version now
extends to eleven tables.
→ Notice that in the snowflake schema, the attributes with low cardinality in each original dimension tables are
removed to form separate tables.
→ These new tables are connected back to the original dimension table through artificial keys.

Low-cardinality refers to columns with few unique values

BITS Pilani, Pilani Campus


Star Schema Vs Snowflake Schema – Summary

Star Schema Snowflake Schema


• In a star schema, the fact table will be at the • A snowflake schema is an extension of star schema
center and is connected to the dimension where the dimension tables are connected to one or
tables. more dimensions.
• The tables are completely in a denormalized • The tables are partially denormalized in structure.
structure. • The performance of SQL queries is a bit less when
• SQL queries performance is good as there is less compared to star schema as more number of joins are
number of joins involved. involved.
• Data redundancy is high and occupies more disk • Data redundancy is low and occupies less disk space
space when compared to star schema

BITS Pilani, Pilani Campus


Star Schema – Demo

BITS Pilani, Pilani Campus


Fact Constellation Schema or Galaxy Schema
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
• The sales fact table is same as that in the star schema.
• The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.
• The shipping fact table also contains two measures, namely dollars sold and units sold.
• It is also possible to share dimension tables between fact tables. For example, time, item,
and location dimension tables are shared between the sales and shipping fact table.

BITS Pilani, Pilani Campus


Fact Constellation Schema
• The primary disadvantage of the fact constellation schema is that it is a more
challenging design because many variants for specific kinds of aggregation must be
considered and selected.

BITS Pilani, Pilani Campus


Starflake Schema Or Star Cluster Schema
• A Snowflake schema with many dimension tables may need more complex joins while querying. A star
schema with fewer dimension tables may have more redundancy. Hence, a star cluster schema came
into the picture by combining the features of these two schemas.
• Star schema is the base to design a star cluster schema and few essential dimension tables from the
star schema are snowflaked and this, in turn, forms a more stable schema structure.

BITS Pilani, Pilani Campus


Slowly Changing Dimensions (SCDs)
• Dimensions in a data warehouse are the descriptive information about business entities,
and they often change over time.
• Slowly Changing Dimensions (SCDs) are a concept in data warehousing that deals with
how to handle changes in dimension data over time.
• SCDs provide a way to manage these changes so that historical data remains accurate and
meaningful.

BITS Pilani, Pilani Campus


Slowly Changing Dimensions (SCDs) - Types
Type 1 - Overwrite:
• In a Type 1 SCD, when a change occurs, the existing dimension record is simply updated with the new
information.
• There is no tracking of historical changes, so the historical perspective is lost.
• This method is suitable when historical changes are not important for analysis, and the most recent
information is sufficient.
• A good example of this is customer addresses. You don’t need to keep track of how a customer’s
address has changed over time, you just need to know you are sending an order to the right place.

Type 2 – Add New Row:


• In a Type 2 SCD, when a change occurs, a new row is added to the dimension table with the updated
information.
• The existing row remains unchanged, and it is marked as inactive or outdated.
• This approach preserves the historical changes and allows for tracking the history of the dimension
over time.
• This method is suitable when historical changes need to be maintained for reporting and analysis.
• You can also handle type 2 dimensions by adding a timestamp column or two to show when a new
record was created or made active and when it was made ineffective. Instead of checking for whether a
record is active or not, you can find the most recent timestamp and assume that is the active data row.
You can then piece together the timestamps to get a full picture of how a row has changed over time.

BITS Pilani, Pilani Campus


Slowly Changing Dimensions (SCDs) - Types

Type 2 – Add New Row - Example

BITS Pilani, Pilani Campus


Slowly Changing Dimensions (SCDs) - Types

Type 3 – Add Columns


• Type 3 dimensions track changes in a row by adding a new column.
• Instead of adding a new row with a new primary key like with type 2 dimensions, the primary key
remains the same and an additional column is appended.
• This is good if you need your primary key to remain unique and only have one record for each natural
key.
• However, you can really only track one change in a record rather than multiple changes over time.
• Think of this as a dimension you’d want to use for one-time changes.
• For example, let’s say your warehouse location is changing. Because you don’t expect the address of
your warehouse to change more than once, you add a `current_address` column with the address of
your new warehouse. You then change the original address column name to be `previous_address`
and store your old address information.

BITS Pilani, Pilani Campus


Slowly Changing Dimensions (SCDs) - Types

Type 4 – Add New Table


• Type 4 dimensions exist as records in two different tables- a current record table and a
historical record table.
• All of the records that are active in a given moment will be in one table and then all of the
records considered historical will exist in a separate history table.
• This is a great way of keeping track of records that have many changes over time.
.

BITS Pilani, Pilani Campus


Thank you

BITS Pilani, Pilani Campus


Data Warehouse - ETL

Instructor-in-Charge:
Sachin Arora, Guest faculty
BITS Pilani BITS Pilani
Pilani Campus
Disclaimer

Content in this deck have been curated from multiple online sources, including vendor’s
documentation

BITS Pilani, Pilani Campus


Module 4

BITS Pilani, Pilani Campus


Module 4 – ETL

4.1. ETL Overview


4.2 Data Extraction
4.3. Data Transformation
4.4. Data Loading
4.5. Data Quality
4.6 Snowflake Cloud Datawarehouse – Data Loading

BITS Pilani, Pilani Campus


ETL

BITS Pilani, Pilani Campus


ETL (Extract, Transform, Load)

ETL (Extract, Transform, Load) is a process in data warehousing and data integration that
involves the following key steps:

Step 1: Extract

The extraction process involves copying or exporting raw data from multiple locations called
source locations and storing them in a staging location for further processing.

Source locations can consist of any type of data, including SQL or NSQL servers, flat files,
emails, logs, web pages, CRM, ERP systems, spreadsheets, logs, etc.

Common data extraction methods are:


▪ Partial extraction with update notification
▪ Partial extraction without update notification
▪ Full extraction

BITS Pilani, Pilani Campus


ETL (Extract, Transform, Load)
Step 2: Transform
In the transformation stage of the ETL process, data in the staging area is
transformed through the data processing phase to make it suitable for use for
analytics. Raw data is converted to a consolidated, meaningful data set.

Several tasks are performed on the data like:


▪ Cleaning and Standardization
▪ Verification and Validation
▪ Filtering and Sorting
▪ De-duplication
▪ Data audits
▪ Calculations, Translations
▪ Formatting
▪ Data encryption, protection

BITS Pilani, Pilani Campus


ETL (Extract, Transform, Load)
Step 3: Load
In this final step of the ETL process, the transformed data is loaded onto its
target destination, which can be a simple database or even a data warehouse.
The size and complexity of data, along with the specific organizational needs,
determine the nature of the destination.

The load process can be:


Full loading – occurs only at the time of first data loading or for disaster recovery
Incremental loading – loading of updated data

BITS Pilani, Pilani Campus


ETL (Extract, Transform, Load)

BITS Pilani, Pilani Campus


ETL (Extract, Transform, Load)
ETL Tools - Some prominent ETL software tools are:

▪ Talend
▪ Oracle Data Integrator
▪ Amazon RedShift
▪ AWS Glue
▪ Matillion
▪ Azure Data Factory
▪ Fivetran

BITS Pilani, Pilani Campus


Process for ETL Testing (aka Data Quality Check)

ETL testing is unique since:


• ETL processes are background processes and don’t have user screens.
• ETL testing involves a large amount of data.
• ETL processes are like functions where testing requires execution of the ETL process
and then the comparison of input and output data.
• The defects in the ETL processes cannot be detected by simply reviewing the ETL
code

BITS Pilani, Pilani Campus


How to do ETL Testing?
ETL processes are evaluated indirectly through black box testing approach, wherein the ETL
process is first executed to create the output data and then by verifying the output data the
quality of the ETL process is determined.

ETL testing process is summarized in the following three steps

1. First, the ETL code is executed to generate the output data.


2. Then the output data is compared with the predetermined expected data.
3. Based on the comparison results, the quality of the ETL process is determined.

BITS Pilani, Pilani Campus


DQ Industry Tools
Here are few DQ tools available in the industry today:

• Informatica Data Quality


• Talend
• Ataccama
• SAP Information Steward
• Collibra

BITS Pilani, Pilani Campus


Snowflake - Data Loading

BITS Pilani, Pilani Campus


Loading Data into Snowflake
• Snowflake refers to the location of data files in cloud storage as a stage. The COPY INTO
<table> command used for both bulk and continuous data loads (i.e. Snowpipe) supports cloud
storage accounts managed by your business entity (i.e. external stages) as well as cloud storage
contained in your Snowflake account (i.e. internal stages).
• External stages: Loading data from any of the following cloud storage services is supported
regardless of the cloud platform that hosts your Snowflake account: Amazon S3, Google Cloud
Storage, Microsoft Azure Data Lake Storage (ADLS)

• Internal stages Snowflake maintains the following stage types in your account:

➢ User: A user stage is allocated to each user for storing files.


➢ Table: A table stage is available for each table created in Snowflake. This stage type is designed
to store files that are staged and managed by one or more users but only loaded into a single
table.
➢ Named: This stage type can store files that are staged and managed by one or more users and
loaded into one or more tables.

Upload files to any of the internal stage types from your local file system using the PUT command.

BITS Pilani, Pilani Campus


Snowflake - Data Loading Local File System

BITS Pilani, Pilani Campus


Snowflake - Data Loading Cloud Storage (Azure Example)

BITS Pilani, Pilani Campus


Loading Data into Snowflake Using Internal Stage

1. Create Snowflake Internal Stage “CREATE STAGE test_db.schema1.my_int_stage;”


2. Create a table into which data will be loaded from sample file.
3. Install SnowSQL client
4. Connect to Snowflake using SnowSQL Client “snowsql -a <account name> -u <username>”
5. Load a sample file from your laptop to Internal stage via PUT command: put file://c:/samplefile.csv
@my_int_stage; (this syntax is for Windows client only; For Linux or Mac, refer to documentation:
https://docs.snowflake.com/en/sql-reference/sql/put)
6. Load data into the table using the command as below:

copy into <DB_Name>.<Schema_Name>.<Table_Name> from @ <DB_Name>.<Schema_Name>.<Stage_Name>


/<FileName> FILE_FORMAT = ( TYPE = <FORMAT OF FILE>, <additional parameters as needed>);

Example:
copy into test_db.schema1.st_emp
from @test_db.schema1.my_int_stage/samplefile.csv
FILE_FORMAT = ( TYPE = CSV, skip_header=1);

BITS Pilani, Pilani Campus


File Format Support by Snowflake

File formats supported: CSV | JSON | AVRO | ORC | PARQUET | XML

Data Loading Options for each file format are available here: CREATE FILE FORMAT | Snowflake Documentation

Example:

CREATE OR REPLACE FILE FORMAT my_csv_format


TYPE = CSV
FIELD_DELIMITER = '|'
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = true
COMPRESSION = gzip;

BITS Pilani, Pilani Campus


Loading Data into Snowflake Using External Stage

1. Create Snowflake Internal Stage “CREATE STAGE my_azure_stage STORAGE_INTEGRATION = azure_int URL =
'azure://myaccount.blob.core.windows.net/mycontainer/load/files/' FILE_FORMAT = my_csv_format;”
2. Create a table into which data will be loaded from sample file.
3. Load data into the table using the command as below:

copy into <DB_Name>.<Schema_Name>.<Table_Name> from @ <DB_Name>.<Schema_Name>.<Stage_Name>


/<FileName> FILE_FORMAT = ( TYPE = <FORMAT OF FILE>, <additional parameters as needed>);

Example:
copy into test_db.schema1.st_emp
from @test_db.schema1.my_azure_stage/samplefile.csv
FILE_FORMAT = ( TYPE = CSV, skip_header=1);

BITS Pilani, Pilani Campus


AWS Snowflake

S3
External Stage

Emp.csv
Snowflake

Internal Stage

Emp.csv
Azure

Oracle
ADF Snowflake

SQL
ADLS
Thank you

BITS Pilani, Pilani Campus

You might also like