0% found this document useful (0 votes)

19 views37 pages

DW Mod 4

The document outlines principles of dimensional modeling in data warehousing, contrasting E-R modeling for OLTP systems with dimensional modeling tailored for data analysis. It discusses the STAR schema structure, characteristics of fact and dimension tables, and the importance of surrogate keys. Additionally, it covers Slowly Changing Dimensions (SCD) types for managing historical data changes in dimension tables.

Uploaded by

roogipooja3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views37 pages

DW Mod 4

Uploaded by

roogipooja3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

VISVESVARAYA TECHNOLOGICAL

UNIVERSITY
JNANA SANGAMA, BELGAVI-590018, KARNATAKA

DATA WAREHOUSING
(As per CBCS Scheme 2022)

Sub Code: BAD515B

PREPARED BY:
INDHUMATHI R (ASST.PROF DEPT OF DS (CSE), KNSIT)

DEPARTMENT OF COMPUTER SCIENCE (DATA SCIENCE) AND ENGINEERING

K.N.S INSTITUTE OF TECHNOLOGY

HEGDE-NAGAR, KOGILU ROAD,
THIRUMENAHALLI, YELAHANKA,
BANGALORE-560064
DATA WAREHOUSING BAD515B

MODULE: 4
CHAPTER 10: PRINCIPLES OF DIMENSIONAL MODELING

1. Introduction to Dimensional Modeling

Requirements Definition Drives Data Design:

 Before designing a data warehouse, you gather requirements to understand what data is
needed for analysis. This includes identifying business metrics (facts) and how users
want to analyze these metrics (dimensions).

E-R Modeling vs. Dimensional Modeling in Data Warehousing

1. E-R Modeling (Entity-Relationship Modeling) for OLTP Systems

E-R modeling is designed to manage operational data in OLTP systems where the primary
focus is transaction-level detail with high efficiency in capturing, updating, and maintaining
data integrity.

Characteristics:

 Captures details of transactions: OLTP systems record micro-level transactional

data, essential for daily operations.
 Focus on individual events: Each record represents a single event (e.g., an order, a
payment).
 Microscopic relationships: Ensures data consistency and non-redundancy by
normalizing tables to remove redundancy.

Page | 2
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Efficient storage: Optimized for minimal data duplication.

 Suitability: Best for applications that need rapid queries related to individual
transactions (e.g., "What is the balance in this account?").

Advantages:

 Data integrity and consistency: Strict normalization rules ensure consistent data.
 Optimized for updates and short transactions: Quick response times for individual
transaction queries.

2. Dimensional Modeling for Data Warehousing

Dimensional modeling is tailored for data warehouses that focus on analyzing large datasets
for decision-making, often across multiple dimensions (e.g., customer, time, product).

Characteristics:

 Summarized, aggregated data: Focuses on business processes rather than individual

transactions.
 Multi-dimensional views: Provides business users intuitive insights by allowing them
to analyze data from various angles (e.g., by time, region, product).
 De-normalized tables: Designed for quick retrieval of data rather than efficient
storage, using structures like the STAR schema.
 Business trends: Reveals overall business performance rather than just individual
transactions.
 Measures along dimensions: Each measure (e.g., sales, revenue) can be viewed across
multiple dimensions (e.g., by customer, product, region).

Advantages:

 Easy to understand: Simple schema structure (fact and dimension tables) makes it
intuitive for business users.
 Efficient for querying and reporting: Optimized for read-heavy operations involving
aggregations.
 Drill-down and roll-up analysis: Enables users to explore data at different
granularities.

Page | 3
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

STAR Schema in Dimensional Modeling

The STAR schema consists of a central fact table surrounded by multiple dimension tables.
It’s ideal for answering complex business questions by joining facts with dimensions.

Example Scenario:

In a manufacturing company:

 Fact Table: Orders (contains measures like quantity sold, revenue, profit).
 Dimension Tables: Customer, Product, Salesperson, Date.

Analysis Queries:

 What was sold? – Product dimension.

 When was it sold? – Date dimension.
 Who bought it? – Customer dimension.
 Who sold it? – Salesperson dimension.

Page | 4
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Page | 5
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Inside a Dimension Table:

Characteristics of a Dimension Table:

1. Primary Key:
o Each row in a dimension table is uniquely identified by a primary key. This
key links the dimension table to the fact table.
2. Wide Table:
o Dimension tables typically have many columns or attributes (sometimes more
than 50), making them wide. Each attribute provides descriptive details.
3. Textual Attributes:
o Dimension tables mostly contain textual data rather than numerical data used
in calculations. These attributes describe business entities, e.g., product names,
customer locations, etc.
4. Attributes Not Directly Related:
o Some attributes in a dimension table might not be directly related to one another.
For example, "product brand" and "package size" could both be part of the
product dimension, though they are unrelated.
5. Denormalized Structure:
o Dimension tables are not normalized to avoid complex joins and improve
query performance. This means redundant data is allowed to ensure faster
access.
6. Drilling Down and Rolling Up:
o Attributes allow for hierarchical analysis. For instance:
 Drill down: State → City → Zip Code
 Roll up: Zip Code → City → State
7. Multiple Hierarchies:
o Dimension tables can have multiple hierarchies depending on different
business needs. For example, in a product dimension:
 Marketing hierarchy: Category → Department → Product
 Finance hierarchy: Product → Finance Category → Finance Department
8. Fewer Records:
o Dimension tables generally contain fewer rows compared to fact tables. For
example, a product dimension table might have 500 rows, whereas the fact table
may have millions.

Page | 6
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Page | 7
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Inside the Fact Table

1. Concatenated Key

 Explanation: A fact table uses a combination of primary keys from related dimension
tables to uniquely identify each row.
 Example: Imagine an e-commerce platform where each transaction involves a product,
customer, and date. Each transaction in the fact table will have a unique combination
of product ID, customer ID, and date.

2. Data Grain (Granularity)

 Explanation: It refers to the level of detail stored in the fact table. The lower the grain,
the more detailed the data.
 Example: In a retail chain, grain could be at the daily transaction level (low grain) or
at a monthly sales summary level (high grain). Storing data at the daily level allows for
more detailed analysis like specific-day promotions.

3. Fully Additive Measures

 Explanation: Measures that can be summed across all dimensions.

 Example: In a grocery store, total sales (order_dollars) can be added across products,
customers, and dates to get overall sales.

Page | 8
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

4. Semi-Additive Measures

 Explanation: Measures that can be aggregated only in certain dimensions.

 Example: In a bank, account balances can be summed by customer, but not over time
because summing across time would give incorrect results (e.g., summing balances for
January and February).

5. Sparse Data

 Explanation: Fact tables often have many possible combinations of dimension values,
but not all combinations will have data, leading to gaps.
 Example: In a hotel booking system, there won’t be data for rooms not booked on
certain days, leading to sparse rows.

6. Table Deep, Not Wide

 Explanation: Fact tables have many rows (large datasets) but fewer columns
(attributes).
 Example: In a ticket booking system, each row represents a booking with details like
date, ticket ID, and seat, but there are fewer columns compared to dimension tables.

7. Degenerate Dimensions

 Explanation: Attributes that are neither measures nor dimension-related but are
important for analysis, such as reference numbers.
 Example: In an online order system, order_number and invoice_number are stored in
the fact table to track transactions without needing a separate dimension table.

Page | 9
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

8. Factless Fact Table

 Explanation: Fact tables without measurable data, often used to track events or
activities.
 Example: In a university, tracking student attendance doesn’t need a numeric fact. Each
row simply records that a student attended a specific class on a given day.

9. Data Granularity and Graceful Changes

 Explanation: Fact tables at the lowest grain allow easy addition of new dimensions or
attributes without changing existing rows.
 Example: In a sales system, adding a new region dimension won’t affect the fact table
because it already tracks sales at the individual salesperson level.

10. Trade-off Between Low Grain and Storage

 Explanation: Keeping data at a low grain increases detail but also requires more
storage.
 Example: In a streaming platform, storing every view per user per video gives detailed
insights but consumes significant storage compared to just storing daily summary
views.

Page | 10
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Star Schema Keys

1. Primary Keys in Dimension Tables

 Explanation: Each row in a dimension table is uniquely identified by a primary key.

 Example: In a product catalog, each product is uniquely identified by a product ID,
and in a customer table, by a customer ID.

2. Avoiding Operational System Keys

 Explanation: Using operational system keys directly can lead to inconsistencies when
changes occur.
 Example: In a warehouse system, if product codes are changed due to storage location
updates, using the operational key may lead to data mismatches. Instead, a generated
surrogate key avoids this issue.

Surrogate Keys in Data Warehousing

1. Why Avoid Production Keys?

o Operational Issues: Production system keys often have built-in meanings.
For example, in a product key, certain digits may indicate the warehouse or
category. If the product moves to another warehouse, the key becomes
misleading.
o Reassignment Problem: In some systems, customer IDs are reassigned after
customers leave. Using these as primary keys for a customer dimension can
mix up old and new customer data, leading to data integrity issues.

Page | 11
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

2. Solution: Surrogate Keys

o What are Surrogate Keys?
Surrogate keys are system-generated sequence numbers that have no built-in
meaning.
o Why Use Them?
They avoid problems associated with production keys and ensure data
uniqueness.
o Mapping: While surrogate keys are used as primary keys, the original
operational keys can be stored as non-key attributes for reference.

Foreign Keys in Fact Tables

 In a STAR schema, each dimension table has a primary key that links to the fact
table through foreign keys.
 This ensures a one-to-many relationship between dimension tables and the fact
table, enabling efficient data querying.

Advantages of the STAR Schema

1. User-Friendliness
o Unlike OLTP systems where users rely on predefined templates, data
warehouse users need to formulate their own queries.
o The STAR schema aligns with how users think:
 Fact tables contain metrics (e.g., sales).
 Dimension tables hold attributes (e.g., product, customer, time).
o This makes it intuitive for users to navigate and understand.
2. Optimized Navigation
o Simple Join Paths:
The STAR schema minimizes complex joins, making it easier and faster to
navigate.
o Example:
Analyzing defective parts in a GM automobile:
 Filter by time (January 2009).

Page | 12
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Filter by product (Corvette).

 Filter by problem (chipped paint).
 Navigate through these filters to quickly isolate the problematic
supplier.

3. Efficient Query Processing

o Query Scenario: "What is the total cost of product A sold in San Francisco in
January 2009?"
 Step 1: Filter customer dimension by city (San Francisco).
 Step 2: Filter fact table rows linked to these customers.
 Step 3: Filter time dimension by month (January 2009).
 Step 4: Filter product dimension by product A.
 Final Step: Sum the extended cost from the filtered fact table rows.
4. Drill-Down and Roll-Up
o Drill-Down: Further refine results, e.g., break down total sales by zip code.
o Roll-Up: Expand the results, e.g., aggregate sales from city to state level.

Performance Techniques in STAR Schema

1. STARjoin
o A high-speed, parallelizable join that can process multiple tables in one
pass, improving query performance significantly.
2. STARindex

Page | 13
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o A specialized index on foreign keys to accelerate joins between fact and

dimension tables, making queries faster.

STAR SCHEMA: EXAMPLES

Page | 14
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Page | 15
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Slowly Changing Dimensions (SCD):

In a data warehouse, dimension tables store descriptive attributes that help in

analyzing business data. Over time, these attributes may change. Handling these
changes without losing historical data is crucial, which is where Slowly Changing
Dimensions (SCDs) come into play.
There are three main types of SCDs: Type 1, Type 2, and Type 3.

1. Type 1: Correction of Errors (No History Preservation)

Scenario:

A customer's name is misspelled in the system as "Jonh Doe" instead of "John Doe."

Solution:

The incorrect name is simply overwritten with the correct one. No historical record of the

incorrect spelling is kept because it's not useful.

 Before: Jonh Doe

 After: John Doe

Why Use Type 1?

Page | 16
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Suitable for minor corrections like typos.

 Historical accuracy isn't needed.

2. Type 2: Preservation of History (Track Changes Over Time)

Scenario:

The customer moves from New York to California, and the business needs to track orders by

state.

Solution:

A new row is added to the dimension table with the updated state, and each row is assigned a

unique surrogate key. This allows the platform to track which orders were placed from New

York and which from California.

 Before:

o Customer ID: 123

o Name: John Doe

o State: New York

 After:

o Customer ID: 123

Page | 17
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Name: John Doe

o State: New York

o Effective Date: Before Jan 2024

o New Row: Customer ID: 124 (new surrogate key)

o Name: John Doe

o State: California

o Effective Date: Jan 2024 and beyond

Why Use Type 2?

 Preserves historical data for accurate reporting.

 Useful for tracking changes in attributes that impact business analysis.

3. Type 3: Tentative or Soft Revisions (Limited History Tracking)

Scenario:

The customer’s loyalty tier changes from "Silver" to "Gold," and the business only needs to

track the current and previous tier, not every change.

Page | 18
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Solution:

A new column is added for the previous loyalty tier. Both the current and previous tiers are

stored in the same row.

 Before:

o Customer ID: 123

o Name: John Doe

o Current Tier: Silver

 After:

o Customer ID: 123

o Name: John Doe

o Current Tier: Gold

o Previous Tier: Silver

Why Use Type 3?

 Suitable for cases where limited history is needed.

 Avoids creating multiple rows while preserving recent changes.

Page | 19
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

MISCELLANEOUS DIMENSIONS

1. Large Dimensions

Large dimensions can be defined by:

 Very Deep: A large number of rows (e.g., millions of customers or thousands of

products).
 Very Wide: A large number of attributes (e.g., customer demographics, product
variations).

Challenges with Large Dimensions:

 Slow data loading and browsing.

 Complex queries involving multiple attributes.
 Handling type 2 slowly changing dimensions, which can create numerous rows.

Example: In a telecom company, the customer dimension can reach millions of rows,
containing attributes like customer ID, subscription type, location, and service preferences.
Searching across these attributes can be slow if not handled properly.

2. Multiple Hierarchies

Large dimensions often have multiple hierarchies, meaning different departments use different
sets of attributes to organize and analyze data.

Example: In a retail company, the marketing team might drill down products by category
(Electronics > Smartphones > Brand), while the finance team might group them by profitability
(High > Medium > Low).

Page | 20
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

3. Rapidly Changing Dimensions

These dimensions have attributes that change frequently, making it inefficient to keep creating
new rows using the type 2 approach.

Solutions:

 Mini-Dimensions: Separate rapidly changing attributes into a smaller table, reducing

the number of rows in the main dimension.

Example: In e-commerce, customer behavior data such as purchase frequency or credit rating
changes rapidly. Instead of updating the main customer table, these attributes can be stored in
a behavior mini-dimension.

Page | 21
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

4. Junk Dimensions

Junk dimensions consolidate unrelated, often miscellaneous flags and textual attributes into a
single table instead of bloating the fact table or discarding them.

Example: A logistics company might track package statuses with various flags (e.g.,
"Delayed", "Fragile", "Priority"). Instead of spreading these across different tables, they can be
stored in a junk dimension for easier querying.

Snowflake Schema:

 Definition: A more normalized version of the star schema, where dimension tables are
further broken down into related sub-dimensions.
 Example: In e-commerce, instead of a single product table, there could be separate
tables for product categories, brands, and packages.
 Advantage: Saves storage space and reduces data redundancy.
 Disadvantage: Increases query complexity due to multiple table joins.

Normalization in Snowflake Schema:

 Partial Normalization: Only some dimension tables are normalized (e.g., splitting
product into brand and category).
 Full Normalization: Every dimension is completely broken down (e.g., customer
dimension split into country, region, and city).

Example: In a telecom company, customer data might be broken into separate tables for
country, region, and city to avoid storing redundant geographic data.

Options to Normalize:

1. Partially Normalize: Example: Keep customer table intact but split product into brand
and category.

Page | 22
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

2. Fully Normalize Few Tables: Example: Normalize only high-cardinality tables like
customer addresses.
3. Fully Normalize All Tables: Example: Turn all dimensions into multiple related
tables.

Advantages of Snowflake Schema:

 Storage Efficiency: Eliminates redundant data by normalizing.

 Better Data Integrity: Updates are easier due to reduced redundancy.
 Example: In fashion retail, instead of storing the brand name "Nike" in every product
record, you store it in a brand table.

Disadvantages of Snowflake Schema:

 Complex Queries: Multiple joins slow down query execution.

 Higher Maintenance: More tables to manage and update.

Page | 23
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Example: In a banking system, querying customer transactions might take longer

because of joins between multiple tables for accounts, branches, and regions.

When to Snowflake:

Snowflaking is a method of normalizing dimension tables by breaking them into smaller

subdimension tables. This is done primarily to save storage space and improve query
performance for specific use cases.

Reasons to Snowflake:

1. Storage Optimization:
For large datasets (millions of rows), snowflaking reduces redundancy, thus saving
significant storage space.
o Example: In a telecom company, demographic data like city classifications
can be split into a separate table from customer information to avoid
duplicating this data across millions of rows.

Page | 24
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

2. Granularity Differences:
Different attributes have different granularities and update cycles. Separating these
attributes allows better maintenance and reduces the impact of updates.
o Example: E-commerce platform updates product details frequently but
geographic data remains static. Separating these ensures efficient updates.
3. Frequent Attribute Browsing:
If certain attributes (e.g., demographics) are queried more frequently, placing them in
a subdimension improves access speed.
o Example: Marketing team in a retail company may often query demographic
data for customer segmentation.

Aggregate Fact Tables:

An aggregate fact table is a precalculated summary table created by summarizing data from
the most granular (detailed) fact table. This helps improve query performance by reducing the
number of rows scanned during large queries.

Operational vs. Data Warehouse Queries:

 Operational Queries: Target individual records, such as a single customer or order.

o Example: "Find order details for order ID 12345."
 Data Warehouse Queries: Retrieve large result sets, often involving complex
aggregations.
o Example: "Calculate total sales across all regions for the last quarter."

Granularity in Fact Tables:

Fact tables store metrics (e.g., sales amounts, quantities) at different levels of detail
(granularity). Queries manipulate these metrics to produce aggregated results.

Page | 25
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Example Queries on Granular Fact Tables:

1. Query 1:
"Total sales for customer number 12345678 during the first week of December 2008
for product Widget-1."
o Requires filtering rows by customer key, product key, and time key.
2. Query 2:
"Total sales for customer number 12345678 during the first three months of 2009 for
product Widget-1."
o Similar to Query 1 but involves a broader time range.
3. Query 3:
"Total sales for all customers in the south-central territory for the first two quarters
of 2009 for product category Bigtools."
o Requires summarization by territory and product category, while filtering by
time.

Handling Summations of Large Data Sets:

Running queries on large datasets can be time-consuming due to the need to scan numerous
fact table rows. Using aggregate tables improves query performance by precomputing
summary data.

Page | 26
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Advantages of Aggregate Fact Tables:

1. Faster Query Performance:

Queries run faster by retrieving data from smaller, summarized tables instead of large,
detailed tables.
2. Reduced System Load:
Less processing power is required, reducing system overhead.
3. Real-World Scenario:
In a grocery chain, having pre-aggregated tables for daily sales and product
categories enables quicker decision-making during peak seasons.

Page | 27
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

CHAPTER 12: DATA EXTRACTION, TRANSFORMATION, AND LOADING (ETL)

1. ETL Overview

The ETL process refers to the extraction of data from various source systems, transformation
of that data into a format suitable for analysis, and loading the transformed data into a data
warehouse for storage and querying. ETL is crucial because the data from operational
systems is typically not structured for analytical purposes and needs to be processed and
optimized before storage in a data warehouse.

Key Challenges:

 Time-consuming: The ETL process can take a long time, especially with large
volumes of data.
 Arduous: Many data quality issues need to be resolved, and substantial computational
resources may be needed to handle the tasks.

2. ETL Requirements and Steps

This section covers the necessary steps and factors for successful ETL operations.

Key Factors:

 Data quality: The data must be clean and accurate for meaningful analysis.
 Scalability: ETL processes should be scalable to handle growing volumes of data.
 Error handling: Proper mechanisms for identifying and resolving errors during ETL.

The major steps include:

 Source Identification: Identifying where the data will come from, whether from
operational systems or external sources.
 Data Extraction: The process of retrieving data from various sources.
 Data Transformation: This is where data is cleaned, formatted, and made consistent
for analysis.
 Data Loading: The final step of loading the transformed data into the data
warehouse.

Page | 28
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Data Extraction

Data extraction refers to the process of collecting data from various sources (like operational
systems, databases, or flat files) to load it into a data warehouse. The main challenge of data
extraction is to ensure that the data pulled is accurate, complete, and in the correct format for
further processing.

Source Identification:

 Before extracting data, it’s essential to identify the correct source. This could be
transactional systems (like ERP or CRM), external sources (web data, APIs), or internal
databases.
 The source systems may vary in terms of structure (relational databases, flat files,
NoSQL systems), and the data may be in different formats (XML, JSON, CSV, etc.).

Page | 29
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Data Extraction Techniques:

Data extraction involves pulling data from various source systems to populate a data
warehouse. This process is essential for consolidating data for analytics, but it also presents
significant challenges due to the diversity and complexity of source systems.

Key Techniques for Data Extraction:

 Static Data Capture: This technique captures the data at a fixed point in time,
essentially taking a snapshot of data. It is used primarily for the initial load or full
refresh of the data warehouse, such as when new products are added or a complete
refresh of a dimension table is necessary.
 Incremental Data Capture: Unlike static capture, this involves extracting only the
changes (insertions, updates, deletions) since the last extraction. This method is crucial
for keeping the data warehouse up to date with operational systems without reloading
the entire dataset.
 Transaction Log-Based Extraction: This technique captures changes to the database
by reading the transaction logs, enabling real-time data capture. It is typically used in
environments where up-to-the-minute data is required.
 Database Trigger-Based Extraction: Database triggers are set to capture changes at
the point of data modification in the source system, often used when database-level
changes need to be monitored closely.

Page | 30
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Application-Assisted Extraction: This method modifies source system applications to

log changes (like adding, deleting, or updating records), which can then be used for the
data extraction process.

 Capture Based on Date and Time Stamps: This approach relies on records having
time stamps that indicate when a record was created or updated. The data extraction
occurs after the fact, selecting records based on these time stamps. It works well when
data changes infrequently.
 File Comparison: In cases where time stamps are unavailable or impractical, the data
is captured by comparing snapshots of files at different times. This technique compares
the current state of a dataset with its previous snapshot to identify changes. While
effective, it can be computationally expensive and inefficient, especially with large
datasets.

3. Evaluating Data Extraction Techniques:

Different techniques for data extraction come with their advantages and disadvantages.

 Real-Time Extraction (Immediate): This method, often used in environments

requiring current data for decision-making, ensures that data is always up-to-date but
can impact the performance of operational systems.

Page | 31
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Deferred Extraction: This method generally has less impact on operational systems
but may miss intermediary changes between extraction periods. However, it is more
manageable and costs less in terms of system overhead. This technique involves
extracting data at scheduled times, usually during off-peak hours, when the system's
performance is not critical. This approach ensures that the data extraction does not
interfere with the normal operation of the operational systems. It can be highly
beneficial when dealing with large datasets.

Choosing the Right Method: The best data extraction technique depends on the operational
environment, the volume of data, and the business needs. Real-time methods are ideal for
environments that require immediate updates, while deferred methods can be more cost-
effective and less intrusive for less critical updates.

Page | 32
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Data Transformation

Data transformation is the process of converting the extracted data into a format suitable for
analysis in the data warehouse. This step may involve cleaning, mapping, consolidating, or
reformatting the data.

Basic Tasks in Data Transformation:

 Data Cleaning: Involves correcting errors, handling missing data, and resolving
inconsistencies.
 Data Mapping: Ensures that data from different source systems is mapped correctly
into the warehouse schema.
 Data Aggregation: Summarizing data, like calculating totals or averages, to reduce
volume and enhance query performance.

Major Transformation Types:

 Data Integration: Merging data from multiple sources into a unified format.
 Consolidation: Grouping data into a smaller set or changing its structure to be more
suitable for analysis (e.g., transforming transactional data into aggregated data).
 Data Cleansing: Correcting issues like duplicate entries, data in the wrong format, or
outliers.

Implementation of Transformation:

 Dimension Transformation: Standardizes and maps attributes from source systems to

dimension tables (e.g., translating country codes into country names).
 Fact Transformation: Typically includes summing up or averaging values, converting
currencies, or changing timestamps to consistent time zones.

Page | 33
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Data Loading

Data loading is the process of inserting the transformed data into the data warehouse. There are
different methods to perform data loading, depending on the type of load being executed
(initial, full refresh, or incremental).

1. Initial Load: The first load of data into the data warehouse, typically involving all the
data from source systems.
2. Incremental Loads: Regular updates that load only new or modified records since the
last extraction, making the process more efficient.
3. Full Refresh: The entire dataset is refreshed, which can be useful when the structure
of a table has changed or when old data needs to be completely replaced.

Techniques and Processes for Data Loading

 Load: Adds new data to the data warehouse without modifying existing records.
 Append: Appends new records to existing data.
 Destructive Merge: Replaces existing data with new records.
 Constructive Merge: Combines new and existing records based on certain rules,
ensuring that both are preserved.

Page | 34
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Data Integration: This process combines data from different sources and formats into a
cohesive and usable format for decision-making. It often involves handling different data
structures, cleaning the data, and transforming it into a format that suits the warehouse’s needs.
Two approaches for data integration are:

 Enterprise Application Integration (EAI): Involves the use of middleware to connect

various applications within an enterprise, facilitating smooth data exchange.
 Enterprise Information Integration (EII): Involves real-time access to data spread
across various systems, presenting a unified view to the user.

Page | 35
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

4. ETL Tool Options:

The data extraction, transformation, and loading (ETL) process can be significantly aided by
specialized tools that help streamline and automate the process. These tools typically fall into
three broad categories:

 Data Transformation Engines: These tools offer dynamic data manipulation

capabilities, such as converting and formatting data, as well as performing complex
calculations. They are flexible and can handle both full and incremental data loads.
 Data Capture Through Replication: These tools utilize transaction logs or database
triggers to capture changes from the source system in near real-time and replicate them
to the data warehouse.
 Code Generators: These tools automatically generate the necessary code for data
extraction and transformation, providing a programmatic approach to ETL processes.
They can handle common tasks, while also allowing for custom code when needed.

5. Reemphasizing ETL Metadata:

 Metadata is crucial for ETL because it documents all details about the data extraction,
transformation, and loading processes.
 Metadata helps ensure that the data is accurately transformed, loaded, and queried. It
includes information about source and target systems, transformations applied to data,
and historical changes to data over time.
 This section emphasizes the importance of a structured, well-planned approach to data
extraction, ensuring that data is consistently and accurately transferred from operational
systems to the data warehouse.
 It also highlights the tools available to automate these processes and the role of
metadata in ensuring data integrity throughout the ETL process.

ETL Summary and Approach

A well-planned ETL process involves:

 Ensuring data consistency, where data from all sources is harmonized.

 Choosing the right ETL tool for automation and efficiency.

Page | 36
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Monitoring ETL performance to ensure that data is extracted, transformed, and

loaded without delays or errors.

Page | 37
DEPT OF CSE (DS)

4 - Dimensional Modeling
No ratings yet
4 - Dimensional Modeling
71 pages
Unit II DWDM
No ratings yet
Unit II DWDM
97 pages
1.1 (Dimensional Modelling)
No ratings yet
1.1 (Dimensional Modelling)
51 pages
BI - Chap 3 - Data Warehouses Design
No ratings yet
BI - Chap 3 - Data Warehouses Design
54 pages
Unit 3 OLAP and OLTP
No ratings yet
Unit 3 OLAP and OLTP
64 pages
Lecture 3 & 4 - 5610
No ratings yet
Lecture 3 & 4 - 5610
19 pages
Unit-1 Lecture Notes
100% (1)
Unit-1 Lecture Notes
43 pages
CH 3
No ratings yet
CH 3
60 pages
Week 04 - 05
No ratings yet
Week 04 - 05
60 pages
First Part 27 Pages
No ratings yet
First Part 27 Pages
27 pages
Unit - I
No ratings yet
Unit - I
65 pages
Week 3
No ratings yet
Week 3
39 pages
02 - Data Modeling
No ratings yet
02 - Data Modeling
32 pages
Dim Modelling Part 1 - Sh24
No ratings yet
Dim Modelling Part 1 - Sh24
50 pages
Lecture 3
No ratings yet
Lecture 3
42 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
104 pages
Dimensional Modelling
No ratings yet
Dimensional Modelling
30 pages
Unit - 4
No ratings yet
Unit - 4
36 pages
Data Warehousing Fundamentals: Priyanka Deshmukh
No ratings yet
Data Warehousing Fundamentals: Priyanka Deshmukh
43 pages
DW Unit IV Notes
No ratings yet
DW Unit IV Notes
36 pages
Ch4 DW Detailed Version
No ratings yet
Ch4 DW Detailed Version
39 pages
dw4 - Dimension1
No ratings yet
dw4 - Dimension1
75 pages
Cs655 Unit II
No ratings yet
Cs655 Unit II
27 pages
Data Modeling - Presentation PDF
No ratings yet
Data Modeling - Presentation PDF
46 pages
Dimensional Data Modeling - Lecture 3
No ratings yet
Dimensional Data Modeling - Lecture 3
12 pages
Data Warehousing INTERVIEW QUESTION
No ratings yet
Data Warehousing INTERVIEW QUESTION
17 pages
4 Lecture 4-Dimensional Modelling
No ratings yet
4 Lecture 4-Dimensional Modelling
45 pages
DW - Unit 2
No ratings yet
DW - Unit 2
11 pages
Week 5
No ratings yet
Week 5
19 pages
Bi Lecture4 - 2023
No ratings yet
Bi Lecture4 - 2023
49 pages
Dimensional Modelling
No ratings yet
Dimensional Modelling
36 pages
Week 04 & 05
No ratings yet
Week 04 & 05
63 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
7 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
59 pages
Dimensional Analysis: Prithwis Mukerjee, PH.D
No ratings yet
Dimensional Analysis: Prithwis Mukerjee, PH.D
48 pages
Data Warehouse Implementation
No ratings yet
Data Warehouse Implementation
37 pages
DW Unit 4
No ratings yet
DW Unit 4
39 pages
Lecture 3 - Dimensional Modelling 2: Reading Directions
No ratings yet
Lecture 3 - Dimensional Modelling 2: Reading Directions
13 pages
Dimensional Modeling in Data Warehousing
No ratings yet
Dimensional Modeling in Data Warehousing
23 pages
Tutorial # 1
No ratings yet
Tutorial # 1
58 pages
Unit 2
No ratings yet
Unit 2
8 pages
DW CrashCoursePPT
No ratings yet
DW CrashCoursePPT
24 pages
Different Types of Computer Storage Devices
25% (4)
Different Types of Computer Storage Devices
4 pages
DWH Architecture & Concepts
No ratings yet
DWH Architecture & Concepts
37 pages
Dimensional Modelling: CS2.1.1 CS2.1.2
No ratings yet
Dimensional Modelling: CS2.1.1 CS2.1.2
22 pages
Dimensional Modeling: Prof. Sunita Sahu
No ratings yet
Dimensional Modeling: Prof. Sunita Sahu
50 pages
Full Research (Unfinished)
No ratings yet
Full Research (Unfinished)
32 pages
LINUX Practical Exam
0% (1)
LINUX Practical Exam
4 pages
Cost Object Controlling
0% (1)
Cost Object Controlling
6 pages
DWM Exp 1-2
No ratings yet
DWM Exp 1-2
9 pages
Research Proposal RM Uma
No ratings yet
Research Proposal RM Uma
20 pages
Log
No ratings yet
Log
196 pages
Requirement Gathering Questions
No ratings yet
Requirement Gathering Questions
15 pages
Lecture 7 p1
No ratings yet
Lecture 7 p1
38 pages
RP - Dissertation - Assessment - Brief - 2021
No ratings yet
RP - Dissertation - Assessment - Brief - 2021
21 pages
DATAWAREHOUSE PPT NEWW
No ratings yet
DATAWAREHOUSE PPT NEWW
27 pages
Nfs
100% (2)
Nfs
15 pages
Power BI DAX Essentials Getting Started with Basic DAX Functions in Power BI
From Everand
Power BI DAX Essentials Getting Started with Basic DAX Functions in Power BI
Kiet Huynh
5/5 (1)
Session 4 Case Study Retail Case
50% (2)
Session 4 Case Study Retail Case
28 pages
Data Warehousing Dr. L. Rajya Lakshmi
No ratings yet
Data Warehousing Dr. L. Rajya Lakshmi
16 pages
C - S4FCF - 2020 V12.75
No ratings yet
C - S4FCF - 2020 V12.75
17 pages
Lecture 1 Notes: Dimension Tables
No ratings yet
Lecture 1 Notes: Dimension Tables
2 pages
C 01 Dimensional Modeling
No ratings yet
C 01 Dimensional Modeling
30 pages
Dimensional Modelling
No ratings yet
Dimensional Modelling
26 pages
Chapter Four - Data Warehouse Design: SATA Technology and Business Collage
No ratings yet
Chapter Four - Data Warehouse Design: SATA Technology and Business Collage
10 pages
CH 2 Data Collection Management
No ratings yet
CH 2 Data Collection Management
42 pages
Web Application Development 1 - Final Quiz 1
No ratings yet
Web Application Development 1 - Final Quiz 1
4 pages
Data Stage
No ratings yet
Data Stage
10 pages
Chapter 2 - Methodology
No ratings yet
Chapter 2 - Methodology
6 pages
Combination of Artificial Intelligence With Mergers and Acquisitions
No ratings yet
Combination of Artificial Intelligence With Mergers and Acquisitions
7 pages
Introduction To Tibco EMS
No ratings yet
Introduction To Tibco EMS
12 pages
Transport Protocol: Sender File Adapter Parameters
No ratings yet
Transport Protocol: Sender File Adapter Parameters
11 pages
Data Mning
No ratings yet
Data Mning
10 pages
Unit 4 BDTT
No ratings yet
Unit 4 BDTT
23 pages
What Is Data Warehouse?: Explanatory Note
No ratings yet
What Is Data Warehouse?: Explanatory Note
11 pages
Data Warehouse Ques
No ratings yet
Data Warehouse Ques
10 pages
What Is Dimensional Model
No ratings yet
What Is Dimensional Model
7 pages
Baramija All
No ratings yet
Baramija All
32 pages
What Is A Data Warehouse
No ratings yet
What Is A Data Warehouse
11 pages
Virtual University of Pakistan: Statistics and Probability
No ratings yet
Virtual University of Pakistan: Statistics and Probability
5 pages
SIP Format & Reporting Schedule
No ratings yet
SIP Format & Reporting Schedule
4 pages
Chapter 3
100% (1)
Chapter 3
2 pages
Title - impacts-WPS Office
No ratings yet
Title - impacts-WPS Office
2 pages
15IT0YA - Database Management Systems: Sri Vinitha V Assistant Professor Department of IT
No ratings yet
15IT0YA - Database Management Systems: Sri Vinitha V Assistant Professor Department of IT
23 pages
Chapter 1. Getting Started: Copying The Northwind Sample Database
No ratings yet
Chapter 1. Getting Started: Copying The Northwind Sample Database
10 pages
I. Write SQL Statements To Create Database "Productorders" As Following
No ratings yet
I. Write SQL Statements To Create Database "Productorders" As Following
3 pages
Nandan Resume
No ratings yet
Nandan Resume
1 page
Property Pallete
No ratings yet
Property Pallete
5 pages
Entity SQL Injection Attacks
No ratings yet
Entity SQL Injection Attacks
2 pages
INDR 481 01) Information Systems Spring 2021: Syllabus
No ratings yet
INDR 481 01) Information Systems Spring 2021: Syllabus
3 pages