0% found this document useful (0 votes)

16 views15 pages

DWDM

The document provides an overview of Data Warehousing and Data Mining, focusing on the structure, features, applications, and components of data warehouses. It discusses the importance of data warehouses for decision-making, the processes involved in building them, and the differences between database systems and data warehouses. Additionally, it covers various data models, including star and snowflake schemas, and outlines the architecture styles for parallel processing in data warehousing.

Uploaded by

21csme029hamdan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views15 pages

DWDM

Uploaded by

21csme029hamdan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Allenhouse Institute of Technology (AKTU Code: 505)

Rooma, Kanpur – 208 008

Digital Notes

[Department of Computer Science & Engineering]

Subject Name : Data Warehousing and Data Mining

Subject Code : KAI075

Course : B.TECH

Branch : CS-AIML

Semester : 7th

Prepared by : Mr. Yogendra Singh

1
UNIT-I: DATA WAREHOUSING

Introduction to Data warehouse

A data warehouse is a collection of data marts representing historical data from different
operations in the company. This data is stored in a structure optimized for querying and data
analysis as a data warehouse.
Data warehouse provides architectures and tools for business executives to systematically
organise, understand, and use their data to make strategic decisions.

The term Data Warehouse was coined by Bill Inmon in 1990. According to Bill Inmon
"A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection
of data in support of management's decision making process".

Features/Characteristics of Data Warehouse

Data Warehouse has following characteristics:
 Subject Oriented
 Integrated
 Time-variant
 Non-volatile
Subject Oriented: A data warehouse can be used to analyze a particular subject area.
For example, "sales" can be a particular subject

Integrated: A data warehouse integrates data from multiple data sources

For example, source A and source B may have different ways of identifying a product, but in
a data warehouse, there will be only a single way of identifying a product.
Time-variant: All data in the data warehouse is identified with a particular time period.
For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data
from a data warehouse.
Non-volatile: Data is stable in a data warehouse.

Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.

Data Warehouse Applications

Data warehouses are widely used in the following fields:

 Consumer goods
 Banking services

2
 Financial services
 Manufacturing
 Retail sectors

Goals of Data Warehousing:

To help reporting as well as analysis
Maintain organization’s historical data
Be the foundation for Decision-making.

Need of Data warehouse:

Data warehouse is needed for the following reasons”
Business User
Store Historical data
Make strategic decisions
High Response Time

3
Data Warehouse Components:

The major components of a data warehouse are as follows –

1. Data warehouse database: This is the central part of the data warehousing
environment. The database is implemented on the RDBMS technology.
2. Data Source: source data can be grouped into 4 categories:
 Production Data: comes from operational system of enterprise
 Internal Data: private datasheet, documents, customer profiles etc.
 Archived data: old data is archived
 External Data.

3. Data Staging: After we have been extracted data from various operational system and
external sources, we have to prepare the files for storing in the data warehouse. The
extracted data coming from several different sources need to be changed, converted,
and made ready in a different format that is relevant to be saved for querying and
analysis.

In data staging three functions are performed known as ETL.

Extraction: Data coming from different data sources is extracted in different ways.
For each data source technique is different.

Transformation: data transformation includes data cleaning which includes

providing default values for missing data elements or elimination of duplicates.
Sorting and merging data coming from different sources. When the data

4
transformation function ends, we have a collection of integrated data that is cleaned,
standardised, and summarised.

Loading: After completing structure and construction of data warehouse when we go

live for the first time.. Information need to be loaded in data warehouse storage. .

Meta data:
It is data about data. It stores the location and types of all the data in data warehouse. it
is used for maintaing, managing and using the data warehouse. It is classified into two
types:
1. Technical metadata: It contains information about data warehouse data used
by warehouse designer, administration to carry out development and
management tasks. It include,
 Info about data store
 Transaction descriptions.
2. Business metadata: It contains info that give info stored in data warehouse to
users.
4. Access tools
Its purpose is to provide info to business users for decision making. There are five
categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
Online Analytical Processing tools (OLAP tools)
Data mining tools

5. Data Marts: It is a subset of large amount of data present in a system. Each subset is
specific to a particular group of users and is related to a particular subject.

6. Data warehouse admin and management

The management of data warehouse includes,

 Security and priority management
 Monitoring update from multiple sources
 Data quality checks
 Auditing and reporting data warehouse usage and status

5
6. Information delivery system
 It is used to enable the process of subscribing for data warehouse info.
 Delivery to one or more destinations according to specified scheduling
algorithm.

Building a Data Warehouse:-

There are several things to be considered while building a successful data warehouse
1. Business considerations
2. Design considerations
3. Technical considerations
4. Implementation consideration

1. Business considerations:-
Organizations interested in development of a data warehouse can choose one of the
following two approaches:

1. Top-Down Approach (Suggested by Bill Inmon): In Top-down approach, data

warehouse is build first. The data marts are then created fromk the data warehouse.
2. Bottom-Up Approach (Suggested by Ralph Kimball): In bottom-up approach, data
marts are created first and then data warehouse is build.

2. Design considerations:- There are several points related to data warehouse design:
a. Data Content: The data warehouse system should not contain as much details- level
data as the operational system used to source this data in.
b. Metadata: It is data about data. It means it is a description and context of the data. It
helps to organize, find abd understand data.
c. Data distribution: it becomes nessary to know how the data should be divided
across multiple servers and which users should get access to which type of data.
d. Tools: The tools provide the facilities for define the transformation and cleanup

rules, data moment, user query, reporting and data analysis.

3. Technical considerations: A number of technical issues are to be considered when

implementing and building a data warehouse system.
4. Implementation consideration: The implementation of data warehouse requires the
integration of many products.

6
1. Access tools: Ranking, statistical analysis, time series analysis, artificial
intelligence, information mapping are some of the examples of access tools
types.
2. Data extraction, cleanup and transformation and migration.
3. Metadata: it is data about data.

Mapping the Data Warehouse to a Multiprocessor Architecture:-

There are three DBMS software architecture styles for parallel processing:

1. Shared memory or shared everything Architecture

2. Shared disk architecture

3. Shred nothing architecture

1. Shared Memory Architecture

Multiple processors share the main memory space, as well as mass storage (e.g., hard disk
drives).
Strictly integrated shared memory systems have the following characteristics:
 Multiple Processor Units (Pus) share memory.
 Each PU has full access to all shared memory through a common bus.
 Communication between nodes occurs via shared memory.
 Performance is limited by the bandwidth of the memory bus.
 It is simple to implement and provide a single program image, using RDBMS
turned on SMP(Symmetric multiprocessor)

7
Disadvantage:-

Scalability is limited by bus bandwidth and latency, and by available memory.

Shared Disk Architecture

Shared disk systems are typically loosely coupled. Such systems, illustrated in following
figure, have the following characteristics:

 Each node consists of one or more PUs and associated memory.

 Memory is not shared between nodes.

 Each PU has own local memory

 Communication occurs over a common high-speed bus.

 Each node has access to the same disks and other resources.

 A node can be an SMP if the hardware supports it.

 Bandwidth of the high-speed bus limits the number of nodes (scalability) of the
system.
 Distributed Lock Manager (DLM) is required.

8
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The
Distributed Lock Manager (DLM) is required. Examples of loosely coupled systems are VAX
clusters or Sun clusters.

Since the memory is not shared among the nodes, each node has its own data cache. Cache
consistency must be maintained across the nodes and a lock manager is needed to maintain
the consistency. Additionally, instance locks using the DLM on the Oracle level must be
maintained to ensure that all nodes in the cluster see identical data.

There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software components,
such as the bandwidth of the high-speed bus through which the nodes communicate, and
DLM performance.

Advantages
Shared disk systems permit high availability. All data is accessible even if one node
dies. These systems have the concept of one database, which is an advantage over
shared nothing systems.

Shared disk systems provide for incremental growth.

Disadvantages

Inter-node synchronization is required.

If the workload is not partitioned well, there may be high synchronization overhead.

Shared Nothing Architecture:-

Shared nothing systems are typically loosely coupled. In shared nothing systems only
one CPU is connected to a given disk. If a table or database is located on that disk,
access depends entirely on the PU which owns it. Shared nothing systems can be
represented as follows:

9
Figure: Shared Nothing Architecture

Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scale up. Oracle Parallel Server can
access the disks on a shared nothing system as long as the operating system provides
transparent disk access, but this access is expensive in terms of latency.

Shared nothing systems have advantages and disadvantages for parallel processing:

Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support
applications. Failure is local: if one node fails, the others stay up.

Disadvantages
 More coordination is required.
 More overhead is required for a process working on a disk belonging to
another node. If there is a heavy workload of updates or inserts, as in an
online transaction processing system, it may be worthwhile to consider
data-dependent routing to alleviate contention.

10
Difference between Database System and Data Warehouse
S.no. Database Data Warehouse

1 It involves day-to-day processing. It involves historical processing of

information.
2 It is used to run the business. It is used to analyze the business.

3 It contains current data It contains historical data

4 It is based on Entity Relational ship It is based on Star Schema, Snowflake

model. Schema, Fact Constellation Schema
5 Optimized for write operation. Optimized for read operations.
6. Performance is low for analysis High performance for analytical queries.
queries.

Multidimensional data model:

 Multidimensional data model stores data in the form of data cube. A data cube allows
data to be viewed in multiple dimensions.
 Dimensions are entities with respect to which an organization wants to keep records.
 A multidimensional database helps to provide data-related answers to complex
business queries quickly and accurately.
 Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model. OLAP in data warehousing enables users to view data
from different angles and dimensions.

11
 There three types of multidimensional data model:
1. Star schema model
2. Snow flake schema model
3. Fact constellations

Define data cube A data cube allows data to be modeled and viewed in multiple

dimensions. It is defined by dimensions and facts.

What is a dimension table?

Dimensions are perspectives or entities with respect to whish an organization wants to

keep records. Each dimension may have a table associated with it called dimension table

which further describes the dimension.

Define facts.

A multidimensional data model is typically organized around a central theme and the theme

is represented by a fact table. Facts are numerical measures. Fact table contains the names of

the facts and keys to each of the related dimension tables.

1. Star schema model:-

The most common modeling paradigm is the star schema, in which the data warehouse
contains

1. Star schema consists of data in the form of facts and dimensions. The fact table
present in the center of star and points of the star are the dimension tables.
2. A large central table (fact table) containing the bulk of the data, with no redundancy.
3. A set of smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a star burst, with the dimension tables displayed in a radial
pattern around the central fact table. It may have any number of dimension tables and
many-to-one relationship between the fact table and each dimension table.

12
Example: Suppose a STAR schema is composed of a fact table, SALES, and a number of
dimensions table connected to it for time, product and geographic locations

Figure: STAR Schema

2. Snowflake schema model

It is an extension of the star schema. “Snowflake” is a method of normalizing the dimension

table in a STAR schema.

Snowflake schema is the further splitting of star schema dimension tables into one or more
multiple normalized table thereby reducing the redundancy. A snowflake schema can have
any number of dimensions and each dimension can have any number of levels.

Example:

13
Give the advantages and disadvantages of snowflake schema.
Advantage: Dimension table are kept in a normalized form and thus it is easy to maintain
and saves the storage space.
Disadvantage: It reduces the effectives of browsing since more join is needed to execute a
query.

Fact Constellations:
 A fact constellation can have multiple fact tables that share many dimension tables.
 This type of schema can be viewed as a collection of star snow flake and hence is
called a galaxy schema.
 Fact Constellation Schema describes a logical structure of data warehouse or data
mart. Fact Constellation Schema can design with a collection of de-normalized FACT,
Shared, and Conformed Dimension tables.
 The main disadvantage of fact constellation schemas is its more complicated design.

Difference between STAR schema and Fact constellation

S. no. STAR schema Fact constellation

1 In star schema, each dimension is In fact constellation, each dimension is

represented by only one table. represented by multiple fact tables.
2 It is simple to understand and easily It is more complex and hard to design.
designed.
3 It does not use normalization. It uses normalization.

4 It saves the space due to single fact It does not save space due to multiple
table. fact tables.

14
15

DATA Ware House & Mining NOTES
100% (2)
DATA Ware House & Mining NOTES
31 pages
Bca Vi Sem (Datawartehousing) Unit - I Notes
No ratings yet
Bca Vi Sem (Datawartehousing) Unit - I Notes
66 pages
DWM Unit 1. Introduction To Data Warehousing
100% (4)
DWM Unit 1. Introduction To Data Warehousing
12 pages
DMW Unit 1
No ratings yet
DMW Unit 1
56 pages
UNIT - 1 - Datawarehouse & Data Mining
100% (1)
UNIT - 1 - Datawarehouse & Data Mining
24 pages
Unit 1 DWDM Complete
No ratings yet
Unit 1 DWDM Complete
104 pages
BI Unit 1
No ratings yet
BI Unit 1
39 pages
Unit 6 Data Warehousing
No ratings yet
Unit 6 Data Warehousing
40 pages
DW Part A Part B Notes
No ratings yet
DW Part A Part B Notes
69 pages
DWM 1
No ratings yet
DWM 1
15 pages
Datawarehouse Unit-2
No ratings yet
Datawarehouse Unit-2
59 pages
Unit One
No ratings yet
Unit One
41 pages
Unit 1 Notes - DW
No ratings yet
Unit 1 Notes - DW
29 pages
Data Warehousing Fundamentals
No ratings yet
Data Warehousing Fundamentals
47 pages
Data Ware House and Its Purposes
No ratings yet
Data Ware House and Its Purposes
13 pages
Bida Notes
No ratings yet
Bida Notes
67 pages
Data Warehousing-Notes (Module - I & II)
No ratings yet
Data Warehousing-Notes (Module - I & II)
32 pages
Data Warehousing and Data Mining Original Notes
No ratings yet
Data Warehousing and Data Mining Original Notes
47 pages
Unit 3 - Notes
No ratings yet
Unit 3 - Notes
20 pages
DB m8 9 10 11 PDF
No ratings yet
DB m8 9 10 11 PDF
170 pages
Data Warehouse Concepts
100% (1)
Data Warehouse Concepts
11 pages
DDBMS Questions Answers
No ratings yet
DDBMS Questions Answers
4 pages
Unit-2 DM
No ratings yet
Unit-2 DM
21 pages
DWDM Unit-1 Notes PDF
No ratings yet
DWDM Unit-1 Notes PDF
17 pages
Data Warehousing
No ratings yet
Data Warehousing
71 pages
Module 1-1basic Concepts
No ratings yet
Module 1-1basic Concepts
40 pages
Data Warehouse Unit1 CS3551
No ratings yet
Data Warehouse Unit1 CS3551
25 pages
INFORMATION MANAGEMENT Unit 3 NEW
100% (1)
INFORMATION MANAGEMENT Unit 3 NEW
61 pages
Data Warehouse
No ratings yet
Data Warehouse
143 pages
Data Warehousing
No ratings yet
Data Warehousing
4 pages
Data Warehouse
No ratings yet
Data Warehouse
22 pages
02 DataWarehousing and OLAP
No ratings yet
02 DataWarehousing and OLAP
66 pages
Unit-1.1 Data Warehouse
No ratings yet
Unit-1.1 Data Warehouse
29 pages
What Is A Data Warehouse
No ratings yet
What Is A Data Warehouse
34 pages
Lect 5 Data Warehousing I - 240924 - 033406
No ratings yet
Lect 5 Data Warehousing I - 240924 - 033406
38 pages
Data Warehouse-Ccs341 Material
No ratings yet
Data Warehouse-Ccs341 Material
58 pages
SQL Cheat Sheet - 1557131235
No ratings yet
SQL Cheat Sheet - 1557131235
12 pages
2024 Meeting 1 - Data Warehouse Fundamentals
No ratings yet
2024 Meeting 1 - Data Warehouse Fundamentals
47 pages
Unit - 1 Introduction To Data Warehousing
No ratings yet
Unit - 1 Introduction To Data Warehousing
57 pages
Overview of Data Warehousing and OLAP
No ratings yet
Overview of Data Warehousing and OLAP
12 pages
Data Warehouse
No ratings yet
Data Warehouse
3 pages
DWDM Unit-1
No ratings yet
DWDM Unit-1
31 pages
DWDM Notes - Final
No ratings yet
DWDM Notes - Final
46 pages
Data Warehouse 9 Oct
No ratings yet
Data Warehouse 9 Oct
15 pages
Unit 1
No ratings yet
Unit 1
22 pages
Data Warehousing-1
No ratings yet
Data Warehousing-1
51 pages
DATA Ware House Mining NOTES
No ratings yet
DATA Ware House Mining NOTES
31 pages
Soft Copy of The Seminar Topic On
No ratings yet
Soft Copy of The Seminar Topic On
23 pages
Data Warehouse Power Point Presentation
No ratings yet
Data Warehouse Power Point Presentation
18 pages
Unit 1 Notes - DW
No ratings yet
Unit 1 Notes - DW
25 pages
R16 4-2 DataMining Notes UNIT-I
No ratings yet
R16 4-2 DataMining Notes UNIT-I
31 pages
DW Unit1
No ratings yet
DW Unit1
26 pages
Advanced Database Presentation
No ratings yet
Advanced Database Presentation
11 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
26 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
52 pages
Data Ware Housing1
No ratings yet
Data Ware Housing1
18 pages
DH&DM Unit-1
No ratings yet
DH&DM Unit-1
16 pages
DBMS (R23) Lab Manual - Final
No ratings yet
DBMS (R23) Lab Manual - Final
55 pages
Smartplant Enterprise: Smartplant Adapter For Smartplant 3D
No ratings yet
Smartplant Enterprise: Smartplant Adapter For Smartplant 3D
53 pages
Power Bi Interview Questions
No ratings yet
Power Bi Interview Questions
15 pages
Malineni Lakshmaiah Engineering College S.KONDA-523101 Andhra Pradesh
No ratings yet
Malineni Lakshmaiah Engineering College S.KONDA-523101 Andhra Pradesh
15 pages
Pro Entity Framework Core 2 For ASP - NET Core MVC Adam Freeman Instant Download
No ratings yet
Pro Entity Framework Core 2 For ASP - NET Core MVC Adam Freeman Instant Download
56 pages
Quiz App Development Guide
No ratings yet
Quiz App Development Guide
7 pages
Lecture 8 Data - Analytics - BI - Ghana
No ratings yet
Lecture 8 Data - Analytics - BI - Ghana
37 pages
Data Warehouse Final Report
No ratings yet
Data Warehouse Final Report
19 pages
Databases
No ratings yet
Databases
102 pages
IR Unit-3
No ratings yet
IR Unit-3
75 pages
SQL Tips Tricks 1734323478
No ratings yet
SQL Tips Tricks 1734323478
78 pages
By Using of Joints
No ratings yet
By Using of Joints
22 pages
Spring Boot 033
No ratings yet
Spring Boot 033
10 pages
Airline Reservation System Literature Review
100% (1)
Airline Reservation System Literature Review
4 pages
Blms Report
No ratings yet
Blms Report
60 pages
DOAG2021 DataPumpDeepDive
No ratings yet
DOAG2021 DataPumpDeepDive
61 pages
DMW Merged
No ratings yet
DMW Merged
454 pages
MIS & Adv. Excel Training Course Brochure
No ratings yet
MIS & Adv. Excel Training Course Brochure
8 pages
A Crash Course in Caching - Part 2 - by Alex Xu
No ratings yet
A Crash Course in Caching - Part 2 - by Alex Xu
9 pages
Lecture 8 Applications of Data Mining
No ratings yet
Lecture 8 Applications of Data Mining
16 pages
TejaswiSVS DataBIEngineer
No ratings yet
TejaswiSVS DataBIEngineer
3 pages
CSE - 311L, Lab - 01
No ratings yet
CSE - 311L, Lab - 01
8 pages
LAB1 Manual
No ratings yet
LAB1 Manual
18 pages
STRT UNIT 3 and 4
No ratings yet
STRT UNIT 3 and 4
25 pages
Experiment On Joins
No ratings yet
Experiment On Joins
5 pages
CV Cloud Operations Engineer Umme Ammara-2
No ratings yet
CV Cloud Operations Engineer Umme Ammara-2
2 pages
Data Modeling and Analysis: Irwin/Mcgraw-Hill
No ratings yet
Data Modeling and Analysis: Irwin/Mcgraw-Hill
33 pages
Kohana 101
No ratings yet
Kohana 101
33 pages
DATABASE MANAGEMENT SYSTEMS (18CS1T02) - End Term Exam - 2020-2021
No ratings yet
DATABASE MANAGEMENT SYSTEMS (18CS1T02) - End Term Exam - 2020-2021
3 pages
Cadm Mid
No ratings yet
Cadm Mid
5 pages
Microsoft MB6-704 Exam
No ratings yet
Microsoft MB6-704 Exam
5 pages

DWDM

Uploaded by

DWDM

Uploaded by

Allenhouse Institute of Technology (AKTU Code: 505)

Rooma, Kanpur – 208 008

[Department of Computer Science & Engineering]

Subject Code : KAI075

Prepared by : Mr. Yogendra Singh

Introduction to Data warehouse

Features/Characteristics of Data Warehouse

Integrated: A data warehouse integrates data from multiple data sources

Data Warehouse Applications

Goals of Data Warehousing:

Need of Data warehouse:

The major components of a data warehouse are as follows –

In data staging three functions are performed known as ETL.

Transformation: data transformation includes data cleaning which includes

Loading: After completing structure and construction of data warehouse when we go

6. Data warehouse admin and management

The management of data warehouse includes,

Building a Data Warehouse:-

1. Top-Down Approach (Suggested by Bill Inmon): In Top-down approach, data

rules, data moment, user query, reporting and data analysis.

3. Technical considerations: A number of technical issues are to be considered when

Mapping the Data Warehouse to a Multiprocessor Architecture:-

1. Shared memory or shared everything Architecture

2. Shared disk architecture

3. Shred nothing architecture

1. Shared Memory Architecture

Scalability is limited by bus bandwidth and latency, and by available memory.

Shared Disk Architecture

 Each node consists of one or more PUs and associated memory.

 Memory is not shared between nodes.

 Communication occurs over a common high-speed bus.

 A node can be an SMP if the hardware supports it.

Shared disk systems provide for incremental growth.

Inter-node synchronization is required.

Shared Nothing Architecture:-

1 It involves day-to-day processing. It involves historical processing of

3 It contains current data It contains historical data

4 It is based on Entity Relational ship It is based on Star Schema, Snowflake

Multidimensional data model:

dimensions. It is defined by dimensions and facts.

What is a dimension table?

Dimensions are perspectives or entities with respect to whish an organization wants to

which further describes the dimension.

the facts and keys to each of the related dimension tables.

1. Star schema model:-

Figure: STAR Schema

2. Snowflake schema model

It is an extension of the star schema. “Snowflake” is a method of normalizing the dimension

Difference between STAR schema and Fact constellation

S. no. STAR schema Fact constellation

1 In star schema, each dimension is In fact constellation, each dimension is

You might also like