0% found this document useful (0 votes)
31 views17 pages

DWDM Unit-1 Notes PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views17 pages

DWDM Unit-1 Notes PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT-1

What is Data Warehousing?


A Data Warehousing (DW) is process for collecting and managing data from varied sources
to provide meaningful business insights. A Data warehouse is typically used to connect and
analyse business data from heterogeneous sources. The data warehouse is the core of the BI
system which is built for data analysis and reporting.

How Data warehouse works?


A Data Warehouse works as a central repository where information arrives from one or more
data sources. Data flows into a data warehouse from the transactional system and other
relational databases.
Data may be:
1. Structured
2. Semi-structured
3. Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data
in the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A
data warehouse merges information coming from different sources into one comprehensive
database.
By merging all of this information in one place, an organization can analyse its customers
more holistically. This helps to ensure that it has considered all the information available.
Data warehousing makes data mining possible. Data mining is looking for patterns in the data
that may lead to higher sales and profits.

Types of Data Warehouse


Three main types of Data Warehouses (DWH) are:
1. Enterprise Data Warehouse (EDW):
Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support
service across the enterprise. It offers a unified approach for organizing and representing data.
It also provides the ability to classify data according to the subject and give access according
to those divisions.
2. Operational Data Store:
Operational Data Store, which is also called ODS, are nothing but data store required when
neither Datawarehouse nor OLTP systems support organizations reporting needs. In ODS,
Data warehouse is refreshed in real time. Hence, it is widely preferred for routine activities
like storing records of the Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of
business, such as sales, finance, sales or finance. In an independent data mart, data can collect
directly from sources.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


General stages of Data Warehouse
Earlier, organizations started relatively simple use of data warehousing. However, over time,
more sophisticated use of data warehousing begun.
The following are general stages of use of the data warehouse (DWH):
Offline Operational Database:
In this stage, data is just copied from an operational system to another server. In this way,
loading,
processing, and reporting of the copied data do not impact the operational system’s
performance.
Offline Data Warehouse:
Data in the Datawarehouse is regularly updated from the Operational Database. The data in
Datawarehouse is mapped and transformed to meet the Datawarehouse objectives.
Real time Data Warehouse:
In this stage, Data warehouses are updated whenever any transaction takes place in
operational database. For example, Airline or railway booking system.
Integrated Data Warehouse:
In this stage, Data Warehouses are updated continuously when the operational system
performs a transaction. The Data warehouse then generates transactions which are passed
back to the operational system.

Components of Data warehouse


Four components of Data Warehouses are:
Load manager:
Load manager is also called the front component. It performs with all the operations
associated with the extraction and load of data into the warehouse. These operations include
transformations to prepare the data for entering the Data warehouse.
Warehouse Manager:
Warehouse manager performs operations associated with the management of the data in the
warehouse. It performs operations like analysis of data to ensure consistency, creation of
indexes and views, generation of demoralization and aggregations, transformation and
merging of source data and archiving and baking-up data.
Query Manager:
Query manager is also known as backend component. It performs all the operation
operations related to the management of user queries. The operations of this Data warehouse
component are direct queries to the appropriate tables for scheduling the execution of queries.
End-user access tools:
This is categorized into five different groups like 1. Data Reporting 2. Query Tools 3.
Application development tools 4. EIS tools, 5. OLAP tools and data mining tools.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Building a Data Warehouse
Business Intelligence has advanced quickly and dramatically in recent years, and many
people are taking advantage of it. To be the most successful and efficient with this newfound
Business Intelligence (BI) power, it’s essential to be able to analyse and harness ALL of your
data. Enter the data warehouse.
Simply put, a data warehouse is a large store of data that is collected from multiple different
sources within a business. A data warehouse is used as storage for data analytic work (OLAP
systems), leaving the transactional database (OLTP systems) free to focus on transactions.
With a significant amount of data kept in one place, it’s now easier for businesses to analyse
and make better-informed decisions.
There are many ways to go about data warehousing. Our focus in this tutorial, however, is the
benefits of building one and the basic foundation required.

Why Should I Build a Data Warehouse?


While having all of your data gathered in one place is arguably the biggest benefit of having a
data warehouse, it is certainly not the only one. Here, we have listed some of the
other benefits of having a data warehouse:
Save Time – Business users can quickly access data from multiple sources within a data
warehouse, meaning that time will not be wasted on retrieving data from multiple sources.
Boost Confidence – Having data transferred automatically to your data warehouse by a
structured system, as opposed to being transferred by human labour, gives you more
confidence that your data is clean, current, and complete.
Increase Insight– Data warehouses structure your data so it is easily analysable.
Improve Security – Managing who has access to your data is much easier when there is
a centralized connection point. Data warehouses make security completely customizable, so
you are able to give access to whoever you would like and lock down all of your other
systems.

When using a data warehouse to its full potential, analysing data becomes convenient and
answering important questions about your business becomes simple. Your data is organized
and available so you can get your answers quickly and securely.
Now that you know why it is beneficial to have a data warehouse for your business, let us
talk about what it takes to build one.

Structure of a Data Warehouse

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Regardless of the specific approach, you take to building a data warehouse, there are three
components that should make up your basic structure: A storage mechanism, operational
software, and human resources.
Storage– This part of the structure is the main foundation— it is where your warehouse will
live. There are two main options when it comes to storage, an in-house server (Oracle,
Microsoft SQL Server) or on the cloud (Amazon S3, Microsoft Azure). An in-house server is
internal hardware that is set up within your office, and the cloud is a digital storage
solution based on external servers. Either is a feasible option when it comes to storage and all
depends on your needs.
Software – This is the operational part of the data warehouse structure. It ‘soften broken
down into two categories— centralization software and visualization software. Centralization
software is needed to collect and maintain the data that comes from all of your separate
databases. Visualization software is needed to take the data and present it in a visual form to
aid in analyzation. Some centralization software includes visualization software as part of its
package, but it is highly recommended that you have both types of software regardless.
Labor– This is the management aspect of the data warehouse, something that is absolutely
essential in having a working solution. To keep your warehouse functional, it might be
necessary to hire new positions within your business. Hiring well-skilled professionals is
crucial, as running a data warehouse requires a lot of knowledge. However, if you choose to
have a cloud-based warehouse, it might not be necessary to have as many human resources.
The cloud is managed by third-party vendors, so it is their responsibility to do routine
maintenance on hardware and servers.

Mapping the data warehouse architecture to Multiprocessor architecture


Data base architectures of parallel processing
There are three DBMS software architecture styles for parallel processing:
Shared memory or shared everything Architecture
Shared disk architecture
Shred nothing architecture

Shared Memory Architecture:


Tightly coupled shared memory systems, illustrated in following figure have the
following characteristics:
Multiple PUs share memory.
Each PU has full access to all shared memory through a common bus.
Communication between nodes occurs via shared memory.
Performance is limited by the bandwidth of the memory bus.
It is simple to implement and provide a single system image, implementing an RDBMS on
SMP (symmetric multiprocessor)

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


A disadvantage of shared memory systems for parallel processing is as follows:
Scalability is limited by bus bandwidth and latency, and by available memory.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in following
figure, have the following characteristics:
Each node consists of one or more PUs and associated memory.
Memory is not shared between nodes.
Communication occurs over a common high-speed bus.
Each node has access to the same disks and other resources.
A node can be an SMP if the hardware supports it.
Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.
The Distributed Lock Manager (DLM) is required.
Parallel processing advantages of shared disk systems are as follows:
Shared disk systems permit high availability. All data is accessible even if one node dies.
These systems have the concept of one database, which is an advantage overshared nothing
systems.
Shared disk systems provide for incremental growth.
Parallel processing disadvantages of shared disk systems are these:
Inter-node synchronization is required, involving DLM overhead and greater dependency on
high-speed interconnect.
If the workload is not partitioned well, there may be high synchronization overhead.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Shared Nothing Architecture
Shared nothing systems are typically loosely coupled. In shared nothing systems only one
CPU is connected to a given disk. If a table or database is located on that disk

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Shared nothing systems are concerned with access to disks, not access to memory.
Adding more PUs and disks can improve scale up.
Shared nothing systems have advantages and disadvantages for parallel processing:

Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.

Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another node.
If there is a heavy workload of updates or inserts, as in an online transaction processing
system, it may be worthwhile to consider data-dependent routing to alleviate contention

Data Warehouse and Database: Differences

Data Warehouse Database


Data warehouse stores historic data. Data The database stores real-time data.
are stored based on Extract, Transform and Data are collected based on Application
Load processes. usage.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Data warehouse model types are: Virtual Database model types are: Hierarchical
warehouse Data Mart Enterprise data database model Relational model Network
warehouse model Entity-relationship model Document
model Star Schema Object-oriented
database model Entity attribute value model
The data warehouse is a central platform for The database is a traditional method of
data storage that helps businesses to collect storing data in tables, columns, and rows,
and integrate data from various operational which allows data queries and processing
sources. easily.
Data warehouses are used for analysing data Databases are used for capturing data,
and performance reporting. storing data, and supporting operational
processes.
Data must be integrated and balanced from Data is balanced within the scope of the
multiple processes. process.
Data is updated in a scheduled manner. Data is updated when a transaction occurs.
The data warehouse consumes data from all The database is designed to be transactional
the databases and creates an optimized layer and they are not designed to perform data
to perform data analytics. analytics.
Databases are structured with a defined
Schema is done or extracted from import. schema

Multi-Dimensional Data Model


The multi-Dimensional Data Model is a method which is used for ordering data in the
database along with good arrangement and assembling of the contents in the database.
The Multi-Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers
to access data in the form of queries. They allow users to rapidly receive answers to the
requests which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi-dimensional
databases. Itis used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data
from many dimensions and perspectives. It is defined by dimensions and facts and is
represented by a fact table. Facts are numerical measures and fact tables contain measures of
the related dimensional tables or names of the facts.

Features of multidimensional data models:


Measures: Measures are numerical data that can be analysed and compared, such as sales or
revenue. They are typically stored in fact tables in a multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as time, location,
or product. They are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between
measures and dimensions in a data model. They provide a fast and efficient way to retrieve
and analyse data.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Aggregation: Aggregation is the process of summarizing data across dimensions and levels
of detail. This is a key feature of multidimensional data models, as it enables users to quickly
analyse data at different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary
of data to a lower level of detail, while roll-up is the opposite process of moving from a
lower-level detail to a higher-level summary. These features enable users to explore data
in greater detail and gain insights into the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For
example, a time dimension might be organized into years, quarters, months, and days.
Hierarchies provide a way to navigate the data and perform drill-down and roll-up operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that
supports fast and efficient querying of large datasets. OLAP systems are designed to handle
complex queries and provide fast response times.

Advantages of Multi-Dimensional Data Model


The following are the advantages of a multi-dimensional data model:
A multi-dimensional data model is easy to handle.
It is easy to maintain.
Its performance is better than that of normal databases (e.g. relational databases).
The representation of data is better than traditional databases. That is because the multi-
dimensional databases are multi-viewed and carry different types of factors.
It is workable on complex systems and applications, contrary to the simple one-dimensional
database systems.
The compatibility in this type of database is an upliftment for projects having
lower bandwidth for maintenance staff.

Disadvantages of Multi-Dimensional Data Model


The following are the disadvantages of a Multi-Dimensional Data Model:
The multi-dimensional Data Model is slightly complicated in nature and it
requires professionals to recognize and examine the data in the database.
During the work of a Multi-Dimensional Data Model, when the system caches, there is a
great effect on the working of the system.
It is complicated in nature due to which the databases are generally dynamic in design.
The path to achieving the end product is complicated most of the time.
As the Multi-Dimensional Data Model has complicated systems, databases have many
databases due to which the system is very insecure when there is a security break.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


DATA CUBES: -
OLAP stands for Online Analytical Processing, which is a technology that enables multi-
dimensional analysis of business data. It provides interactive access to large amounts of data
and supports complex calculations and data aggregation. OLAP is used to support business
intelligence and decision-making processes.
Grouping of data in a multidimensional matrix is called “data cubes”. In Dataware
housing, we generally deal with various multidimensional data models as the data will be
represented by multiple dimensions and multiple attributes. This multidimensional data is
represented in the data cube as the cube represents a high-dimensional space. The Data cube
pictorially shows how different attributes of data are arranged in the data model. Below is the
diagram of a general data cube.

The example above is a 3D cube having attributes like branch(A,B,C,D),item


type(home,entertainment,computer,phone,security), year(1997,1998,1999) .

Data cube classification:


The data cube can be classified into two categories:
 Multidimensional data cube: It basically helps in storing large amounts of data by
making use of a multi-dimensional array. It increases its efficiency by keeping an
index of each dimension. Thus, dimensional is able to retrieve data fast.
 Relational data cube: It basically helps in storing large amounts of data by making
use of relational tables. Each relational table displays the dimensions of the data cube.
It is slower compared to a Multidimensional Data Cube.

Data cube operations:

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Data cube operations are used to manipulate data to meet the needs of users. These operations
help to select particular data for the analysis purpose. There are mainly 5 operations listed
below-
 Roll-up: operation and aggregate certain similar data attributes having the same
dimension together. For example, if the data cube displays the daily income of a
customer, we can use a roll-up operation to find the monthly income of his salary.

 Drill-down: this operation is the reverse of the roll-up operation. It allows us to take
particular information and then subdivide it further for coarser granularity analysis. It
zooms into more detail. For example- if India is an attribute of a country column and
we wish to see villages in India, then the drill-down operation splits India into states,
districts, towns, cities, villages and then displays the required information.

 Slicing: this operation filters the unnecessary portions. Suppose in a particular


dimension, the user doesn’t need everything for analysis, rather a particular attribute.
For example, country=”jamaica”, this will display only about jamaica and only
display other countries present on the country list.

 Dicing: this operation does a multidimensional cutting, that not only cuts only one
dimension but also can go to another dimension and cut a certain range of it. As a
result, it looks more like a subcube out of the whole cube(as depicted in the figure).
For example- the user wants to see the annual salary of Jharkhand state employees.

 Pivot: this operation is very important from a viewing point of view. It basically
transforms the data cube in terms of view. It doesn’t change the data present in the
data cube. For example, if the user is comparing year versus branch, using the pivot
operation, the user can change the viewpoint and now compare branch versus item
type.
Advantages of data cubes:

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


 Multi-dimensional analysis: Data cubes enable multi-dimensional analysis of
business data, allowing users to view data from different perspectives and levels of
detail.
 Interactivity: Data cubes provide interactive access to large amounts of data,
allowing users to easily navigate and manipulate the data to support their analysis.
 Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling fast and
efficient querying and aggregation of data.
 Data aggregation: Data cubes support complex calculations and data aggregation,
enabling users to quickly and easily summarize large amounts of data.
 Improved decision-making: Data cubes provide a clear and comprehensive view of
business data, enabling improved decision-making and business intelligence.
 Accessibility: Data cubes can be accessed from a variety of devices and platforms,
making it easy for users to access and analyze business data from anywhere.
 Helps in giving a summarised view of data.
 Data cubes store large data in a simple way.
 Data cube operation provides quick and better analysis,
 Improve performance of data.
Disadvantages of data cube:
 Complexity: OLAP systems can be complex to set up and maintain, requiring
specialized technical expertise.
 Data size limitations: OLAP systems can struggle with very large data sets and may
require extensive data aggregation or summarization.
 Performance issues: OLAP systems can be slow when dealing with large amounts of
data, especially when running complex queries or calculations.
 Data integrity: Inconsistent data definitions and data quality issues can affect the
accuracy of OLAP analysis.
 Cost: OLAP technology can be expensive, especially for enterprise-level solutions,
due to the need for specialized hardware and software.
 Inflexibility: OLAP systems may not easily accommodate changing business needs
and may require significant effort to modify or extend.

 Snowflake Schema: Snowflake Schema is a type of multidimensional model. It is


used for data warehouse. In snowflake schema contains the fact table, dimension
tables and one or more than tables for each dimension table. Snowflake schema is a
normalized form of star schema which reduce the redundancy and saves the
significant storage. It is easy to operate because it has less number of joins between
the tables and in this simple and less complex query is used for accessing the data
from database.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Advantages:
Reduced data redundancy: The snowflake schema reduces data redundancy by normalizing
dimensions into multiple tables, resulting in a more efficient use of storage space.
Improved performance: The snowflake schema can improve query performance, as it
requires fewer joins to retrieve data from the fact table.
Scalability: The snowflake schema is scalable, making it suitable for large data warehousing
projects with complex hierarchies.
Disadvantages:
Increased complexity: The snowflake schema can be more complex to implement and
maintain due to the additional tables needed for the normalized dimensions.
Reduced query performance: The increased complexity of the snowflake schema can result
in reduced query performance, particularly for queries that require data from multiple
dimensions.
Data integrity: The snowflake schema can be more difficult to maintain data integrity due to
the additional relationships between tables.

Fact Constellation Schema: The fact constellation schema is also a type of


multidimensional model. The fact constellation schema consists of dimension tables that are
shared by several fact tables. The fact constellation schema consists of more than one star
schema at a time. Unlike the snowflake schema, the planetarium schema is not really easy to

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


operate, as it has multiple numbers between tables. Unlike the snowflake schema, the
constellation schema, in fact, uses heavily complex queries to access data from the database.

Let’s see the difference between Snowflake Schema and Fact Constellation Schema:

S.NO Snowflake Schema Fact Constellation

Snowflake schema contains the large While in fact constellation schema,


1. central fact table, dimension tables dimension tables are shared by many fact
and sub dimension tables. tables.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


S.NO Snowflake Schema Fact Constellation

Snowflake schema saves significant While fact constellation schema does not
2.
storage. save storage.

Whereas the fact constellation schema


The snowflake schema consists of one
3. consists of more than one star schema at
star schema at a time.
a time.

In snowflake schema, tables can be In fact constellation schema, the tables


4.
maintained easily. are tough to maintain.

While fact constellation schema is a


Snowflake schema is a normalized
5. normalized form of snowflake schema
form of star schema.
and star schema.

Snowflake schema is easy to operate Fact constellation schema is not easy to


as compared to fact constellation operate as compared to snowflake
6.
schema as it has less number of joins schema as it has multiple number of joins
between the tables. between the tables.

In snowflake schema, to access the While in fact constellation schema, to


7. data from database simple and less access the data from database heavier
complex query is used. complex query is used.

Fact Constellation Schema:


Advantages:
Simple to understand: The fact constellation schema is easy to understand and maintain, as
it consists of a multiple fact table and multiple dimension tables.
Improved query performance: The fact constellation schema can improve query
performance by reducing the number of joins required to retrieve data from the fact table.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA


Flexibility: The fact constellation schema is flexible, allowing for the addition of new
dimensions without affecting the existing schema.

Disadvantages:
Increased data redundancy: The fact constellation schema can result in increased data
redundancy due to repeated dimension data across multiple fact tables.
Storage space: The fact constellation schema may require more storage space than the
snowflake schema due to the denormalized dimensions.
Limited scalability: The fact constellation schema may not be as scalable as the snowflake
schema for large data warehousing projects with complex hierarchies.

HEMANT KUMAR AI&DS IIMT GREATER NOIDA

You might also like