0% found this document useful (0 votes)
40 views57 pages

Data Warehouse Unit 4 CS3551

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views57 pages

Data Warehouse Unit 4 CS3551

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

What is data modeling?

Data modeling is the process of creating a visual representation or a blueprint


that defines the information collection and management systems of any
organization. This blueprint or data model helps different stakeholders, like data
analysts, scientists, and engineers, to create a unified view of the organization’s
data. The model outlines what data the business collects, the relationship
between different datasets, and the methods that will be used to store and
analyze the data.
Why is data modeling important?
Organizations today collect a large amount of data from many different sources.
However, raw data is not enough. You need to analyze data for actionable
insights that can guide you to make profitable business decisions. Accurate data
analysis needs efficient data collection, storage, and processing. There are
several database technologies and data processing tools, and different datasets
require different tools for efficient analysis.
Data modeling gives you a chance to understand your data and make the right
technology choices to store and manage this data. In the same way an architect
designs a blueprint before constructing a house, business stakeholders design a
data model before they engineer database solutions for their organization.
Data modeling brings the following benefits:

• Reduces errors in database software development


• Facilitates speed and efficiency of database design and creation
• Creates consistency in data documentation and system design across the
organization
• Facilitates communication between data engineers and business intelligence
teams
What are the types of data models?
Data modeling typically begins by representing the data conceptually and then
representing it again in the context of the chosen technologies. Analysts and
stakeholders create several different types of data models during the data design
stage. The following are three main types of data models:

Conceptual data model


Conceptual data models give a big picture view of data. They explain the
following:

• What data the system contains


• Data attributes and conditions or constraints on the data
• What business rules the data relates to
• How the data is best organized
• Security and data integrity requirements
The business stakeholders and analysts typically create the conceptual model. It
is a simple diagrammatic representation that does not follow formal data
modeling rules. What matters is that it helps both technical and nontechnical
stakeholders to share a common vision and agree on the purpose, scope, and
design of their data project.

Example of conceptual data models

For example, the conceptual data model for an auto dealership might show the
data entities like this:

1. A Showrooms entity that represents information about the different outlets the
dealership has
2. A Cars entity that represents the several cars the dealership currently stocks
3. A Customers entity that represents all the customers who have made a purchase
in the dealership
4. A Sales entity that represents the information about the actual sale
5. A Salesperson entity that represents the information about all the salespeople
who work for the dealership
This conceptual model would also include business requirements, such as the
following:

• Every car must belong to a specific showroom.


• Every sale must have at least one salesperson and one customer associated with
it.
• Every car must have a brand name and product number.
• Every customer must provide their phone number and email address.
Conceptual models thus act as a bridge between the business rules and the
underlying physical database management system (DBMS). Conceptual data
models are also called domain models.

Logical data model

Logical data models map the conceptual data classes to technical data
structures. They give more details about the data concepts and complex data
relationships that were identified in the conceptual data model, such as these:

• Data types of the various attributes (for example, string or number)


• Relationships between the data entities
• Primary attributes or key fields in the data
Data architects and analysts work together to create the logical model. They
follow one of several formal data modeling systems to create the representation.
Sometimes agile teams might choose to skip this step and move from
conceptual to physical models directly. However, these models are useful for
designing large databases, called data warehouses, and for designing automatic
reporting systems.

Example of logical data models

In our auto dealership example, the logical data model would expand the
conceptual model and take a deeper look at the data classes as follows:

• The Showrooms entity has fields such as name and location as text data and a
phone number as numerical data.
• The Customers entity has a field email address with the format
[email protected] or [email protected]. The field name can be no more
than 100 characters long.
• The Sales entity has a customer’s name and a salesperson’s name as fields,
along with the date of sale as a date data type and the amount as a decimal data
type.
Logical models thus act as a bridge between the conceptual data model and the
underlying technology and database language that developers use to create the
database. However, they are technology agnostic, and you can implement them
in any database language. Data engineers and stakeholders typically make
technology decisions after they have created a logical data model.

Physical data model


Physical data models map the logical data models to a specific DBMS
technology and use the software’s terminology. For example, they give details
about the following:

• Data field types as represented in the DBMS


• Data relationships as represented in the DBMS
• Additional details, such as performance tuning
Data engineers create the physical model before final design implementation.
They also follow formal data modeling techniques to make sure that they have
covered all aspects of the design.

Example of physical data models

Suppose that the auto dealership decided to create a data archive in Amazon S3
Glacier Flexible Retrieval. Their physical data model describes the following
specifications:

• In Sales, the sale amount is a float data type, and the date of sale is a timestamp
data type.
• In Customers, the customer name is a string data type.
• In S3 Glacier Flexible Retrieval terminology, a vault is the geographical
location of your data.
Your physical data model also includes additional details such as which AWS
Region you will create your vault in. The physical data model thus acts as a
bridge between the logical data model and the final technology implementation.
What are the types of data modeling techniques?
Data modeling techniques are the different methods that you can use to create
different data models. The approaches have evolved over time as the result of
innovations in database concepts and data governance. The following are the
main types of data modeling:

Hierarchical data modeling

In hierarchical data modeling, you can represent the relationships between the
various data elements in a tree-like format. Hierarchical data models represent
one-to-many relationships, with parents or root data classes mapping to several
children.
In the auto dealership example, the parent class Showrooms would have both
entities Cars and Salespeople as children because one showroom has several
cars and salespeople working in it.

Graph data modeling

Hierarchical data modeling has evolved over time into graph data modeling.
Graph data models represent data relationships that treat entities equally.
Entities can link to each other in one-to-many or many-to-many relationships
without any concept of parent or child.
For example, one showroom can have several salespeople, and one salesperson
can also work at several showrooms if their shifts vary by location.

Relational data modeling

Relational data modeling is a popular modeling approach that visualizes data


classes as tables. Different data tables join or link together by using keys that
represent the real-world entity relationship. You can use relational database
technology to store structured data, and a relational data model is a useful
method to represent your relational database structure.
For example, the auto dealership would have relational data models that
represent the Salespeople table and Cars table, as shown here:

Salesperson ID Name
1 Jane
2 John
Car ID Car Brand
C1 XYZ
C2 ABC

Salesperson ID and Car ID are primary keys that uniquely identify individual
real-world entities. In the showroom table, these primary keys act as foreign
keys that link the data segments.

Showroom ID Showroom name Salesperson ID Car ID


S1 NY Showroom 1 C1
In relational databases, the primary and foreign keys work together to show the
data relationship. The preceding table demonstrates that showrooms can have
salespeople and cars.

Entity-relationship data modeling

Entity-relationship (ER) data modeling uses formal diagrams to represent the


relationships between entities in a database. Data architects use several ER
modeling tools to represent data.

Object-oriented data modeling

Object-oriented programming uses data structures called objects to store data.


These data objects are software abstractions of real-world entities. For example,
in an object-oriented data model, the auto dealership would have data objects
such as Customers with attributes like name, address, and phone number. You
would store the customer data so that every real-world customer is represented
as a customer data object.
Object-oriented data models overcome many of the limitations of relational data
models and are popular in multimedia databases.

Dimensional data modeling

Modern enterprise computing uses data warehouse technology to store large


quantities of data for analytics. You can use dimensional data modeling projects
for high-speed data storage and retrieval from a data warehouse. Dimensional
models use duplication or redundant data and prioritize performance over using
less space for data storage.
For example, in dimensional data models, the auto dealership has dimensions
such as Car, Showroom, and Time. The Car dimension has attributes like name
and brand, but the Showroom dimension has hierarchies like state, city, street
name, and showroom name.
What is the data modeling process?
The data modeling process follows a sequence of steps that you must perform
repetitively until you create a comprehensive data model. In any organization,
various stakeholders come together to create a complete data view. Although
the steps vary based on the type of data modeling, the following is a general
overview.

Step 1: Identify entities and their properties


Identify all the entities in your data model. Each entity should be logically
distinct from all other entities and can represent people, places, things, concepts,
or events. Each entity is distinct because it has one or more unique properties.
You can think of entities as nouns and attributes as adjectives in your data
model.

Step 2: Identify the relationships between entities

The relationships between the different entities are at the heart of data
modeling. Business rules initially define these relationships at a conceptual
level. You can think of relationships as the verbs in your data model. For
instance, the salesperson sells many cars, or the showroom employs many
salespeople.

Step 3: Identify the data modeling technique

After you conceptually understand your entities and their relationships, you can
determine the data modeling technique that best suits your use case. For
example, you might use relational data modeling for structured data but
dimensional data modeling for unstructured data.

Step 4: Optimize and iterate

You can optimize your data model further to suit your technology and
performance requirements. For example, if you plan to use Amazon Aurora and
a structured query language (SQL), you will put your entities directly into tables
and specify relationships by using foreign keys. By contrast, if you choose to
use Amazon DynamoDB, you will need to think about access patterns before
you model your table. Because DynamoDB prioritizes speed, you first
determine how you will access your data and then model your data in the form
it will be accessed.
You will typically revisit these steps repeatedly as your technology and
requirements change over time.

What is Dimensional Modeling?

Dimensional modeling represents data with a cube operation, making more


suitable logical data representation with OLAP data management.
The advantage of using this model is that we can store data in such a way that
it is easier to store and retrieve the data once stored in a data warehouse. The
dimensional model is the data model used by many OLAP systems.

Elements of Dimensional Data Model

Fact

It is a collection of associated data items, consisting of measures and context


data. It typically represents business items or business transactions.

Dimensions

It is a collection of data which describe one business dimension. Dimensions


decide the contextual background for the facts, and they are the framework over
which OLAP is performed.

Measure

It is a numeric attribute of a fact, representing the performance or behavior of


the business relative to the dimensions.

Considering the relational context, there are two basic models which are used in
dimensional modeling:

o Star Model
o Snowflake Model

The star model is the underlying structure for a dimensional model. It has one
broad central table (fact table) and a set of smaller tables (dimensions) arranged
in a radial design around the primary table. The snowflake model is the
conclusion of decomposing one or more of the dimensions.

Fact Table

Fact tables are used to data facts or measures in the business. Facts are the
numeric data elements that are of interest to the company.

Characteristics of the Fact table

The fact table includes numerical values of what we measure. For example, a
fact value of 20 might means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are
known as foreign keys in the fact table.

Fact tables typically include a small number of columns.

When it is compared to dimension tables, fact tables have a large number of


rows.

Dimension Table

Dimension tables establish the context of the facts. Dimensional tables store
fields that describe the facts.

Characteristics of the Dimension table

Dimension tables contain the details about the facts. That, as an example,
enables the business analysts to understand the data and their reports better.

The dimension tables include descriptive data about the numerical values in the
fact table. That is, they contain the attributes of the facts. For example, the
dimension tables for a marketing analysis function might include attributes such
as time, marketing region, and product type.

Since the record in a dimension table is denormalized, it usually has a large


number of columns. The dimension tables include significantly fewer rows of
information than the fact table.

The attributes in a dimension table are used as row and column headings in a
document or query results display.

Example: A city and state can view a store summary in a fact table. Item
summary can be viewed by brand, color, etc. Customer information can be
viewed by name and address.
Fact Table
Time ID Product ID Customer ID Unit Sold

4 17 2 1

8 21 3 2

8 4 1 1

In this example, Customer ID column in the facts table is the foreign keys that
join with the dimension table. By following the links, we can see that row 2 of
the fact table records the fact that customer 3, Gaurav, bought two items on day
8.
Dimension Tables
Customer ID Name Gender Income Education Region

1 Rohan Male 2 3 4

2 Sandeep Male 3 5 1

3 Gaurav Male 1 7 3

Hierarchy

A hierarchy is a directed tree whose nodes are dimensional attributes and whose
arcs model many to one association between dimensional attributes team. It
contains a dimension, positioned at the tree's root, and all of the dimensional
attributes that define it.

Facts

Facts are the measurable data elements that represent the business metrics of
interest. For example, in a sales data warehouse, the facts might include sales
revenue, units sold, and profit margins. Each fact is associated with one or
more dimensions, creating a relationship between the fact and the descriptive
data.

Dimension

Dimensions are the descriptive data elements that are used to categorize or
classify the data. For example, in a sales data warehouse, the dimensions
might include product, customer, time, and location. Each dimension is made
up of a set of attributes that describe the dimension. For example, the product
dimension might include attributes such as product name, product category,
and product price.
Attributes

Characteristics of dimension in data modeling are known as characteristics.


These are used to filter, search facts, etc. For a dimension of location,
attributes can be State, Country, Zipcode, etc.

Fact Table

In a dimensional data model, the fact table is the central table that contains the
measures or metrics of interest, surrounded by the dimension tables that
describe the attributes of the measures. The dimension tables are related to the
fact table through foreign key relationships

Dimension Table

Dimensions of a fact are mentioned by the dimension table and they are
basically joined by a foreign key. Dimension tables are simply de-normalized
tables. The dimensions can be having one or more relationships.
Steps to Create Dimensional Data Modeling

Step-1: Identifying the business objective: The first step is to identify the
business objective. Sales, HR, Marketing, etc. are some examples of the need
of the organization. Since it is the most important step of Data Modelling the
selection of business objectives also depends on the quality of data available
for that process.

Step-2: Identifying Granularity: Granularity is the lowest level of


information stored in the table. The level of detail for business problems and
its solution is described by Grain.

Step-3: Identifying Dimensions and their Attributes: Dimensions are


objects or things. Dimensions categorize and describe data warehouse facts
and measures in a way that supports meaningful answers to business questions.
A data warehouse organizes descriptive attributes as columns in dimension
tables. For Example, the data dimension may contain data like a year, month,
and weekday.

Step-4: Identifying the Fact: The measurable data is held by the fact table.
Most of the fact table rows are numerical values like price or cost per unit,
etc.
Step-5: Building of Schema: We implement the Dimension Model in this
step. A schema is a database structure. There are two popular schemes: Star
Schema and Snowflake Schema.

Dimensional Data Modeling Steps

Dimensional data modeling is a technique used in data warehousing to


organize and structure data in a way that makes it easy to analyze and
understand. In a dimensional data model, data is organized into dimensions
and facts.

Objectives of Dimensional Modeling

The purposes of dimensional modeling are:

1. To produce database architecture that is easy for end-clients to understand


and write queries.
2. To maximize the efficiency of queries. It achieves these goals by
minimizing the number of tables and relationships between them.

Advantages of Dimensional Modeling


Dimensional modeling is simple: Dimensional modeling methods make it
possible for warehouse designers to create database schemas that business
customers can easily hold and comprehend.

Dimensional modeling promotes data quality:

Performance optimization is possible through aggregates: As the size of the


data warehouse increases, performance optimization develops into a pressing
concern. Customers who have to wait for hours to get a response to a query will
quickly become discouraged with the warehouses

• Simplified Data Access: Dimensional data modeling enables users to


easily access data through simple queries, reducing the time and effort
required to retrieve and analyze data.
• Enhanced Query Performance: The simple structure of dimensional data
modeling allows for faster query performance, particularly when compared
to relational data models.
• Increased Flexibility: Dimensional data modeling allows for more flexible
data analysis, as users can quickly and easily explore relationships between
data.
Improved Data Quality: Dimensional data modeling can improve data
quality by reducing redundancy and inconsistencies in the data. The star
schema enable warehouse administrators to enforce referential integrity checks
on the data warehouse. Since the fact information key is a concatenation of the
essentials of its associated dimensions,
• Easy to Understand: Dimensional data modeling uses simple, intuitive
structures that are easy to understand, even for non-technical users.
Disadvantages of Dimensional Data Modeling
• Limited Complexity: Dimensional data modeling may not be suitable for
very complex data relationships, as it relies on simple structures to organize
data.
• Limited Integration: Dimensional data modeling may not integrate well
with other data models, particularly those that rely on normalization
techniques.
• Limited Scalability: Dimensional data modeling may not be as scalable as
other data modeling techniques, particularly for very large datasets.
• Limited History Tracking: Dimensional data modeling may not be able to
track changes to historical data, as it typically focuses on current data.
Disadvantages of Dimensional Modeling
1. To maintain the integrity of fact and dimensions, loading the data
warehouses with a record from various operational systems is
complicated.
2. It is severe to modify the data warehouse operation if the organization
adopting the dimensional technique changes the method in which it does
business.

Disadvantages of Dimensional Data Modeling


• Limited Complexity: Dimensional data modeling may not be suitable for
very complex data relationships, as it relies on simple structures to organize
data.
• Limited Integration: Dimensional data modeling may not integrate well
with other data models, particularly those that rely on normalization
techniques.
• Limited Scalability: Dimensional data modeling may not be as scalable as
other data modeling techniques, particularly for very large datasets.
• Limited History Tracking: Dimensional data modeling may not be able to
track changes to historical data, as it typically focuses on current data.

What is Multi-Dimensional Data Model?

A multidimensional model views data in the form of a data-cube. A data cube


enables data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.
The dimensions are the perspectives or entities concerning which an
organization keeps records. For example, a shop may create a sales data
warehouse to keep records of the store's sales for the dimension time, item, and
location. These dimensions allow the save to keep track of things, for example,
monthly sales of items and the locations at which the items were sold. Each
dimension has a table related to it, called a dimensional table, which describes
the dimension further. For example, a dimensional table for an item may contain
the attributes item_name, brand, and type.

A multidimensional data model is organized around a central theme, for


example, sales. This theme is represented by a fact table. Facts are numerical
measures. The fact table contains the names of the facts or measures of the
related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The
data is shown in the table. In this 2D representation, the sales for Delhi are
shown for the time dimension (organized in quarters) and the item dimension
(classified according to the types of an item sold). The fact or measure
displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example,
suppose the data according to time and item, as well as the location is
considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data
are shown in the table. The 3D data of the table are represented as a series of 2D
tables.

Conceptually, it may also be represented by the same data in the form of a 3D


data cube, as shown in fig:
Working on a Multidimensional Data Model

On the basis of the pre-decided steps, the Multidimensional Data Model


works.
The following stages should be followed by every project for building a Multi
Dimensional Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi
Dimensional Data Model collects correct data from the client. Mostly,
software professionals provide simplicity to the client about the range of data
which can be gained with the selected technology and collect the complete
data in detail.
Stage 2 : Grouping different segments of the system : In the second stage,
the Multi Dimensional Data Model recognizes and classifies all the data to the
respective section they belong to and also builds it problem-free to apply step
by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the
basis on which the design of the system is based. In this stage, the main factors
are recognized according to the user’s point of view. These factors are also
known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities
: In the fourth stage, the factors which are recognized in the previous step are
used further for identifying the related qualities. These qualities are also
known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and
their qualities : In the fifth stage, A Multi Dimensional Data Model separates
and differentiates the actuality from the factors which are collected by it.
These actually play a significant role in the arrangement of a Multi
Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the
information collected from the steps above : In the sixth stage, on the basis
of the data which was collected previously, a Schema is built.
For Example :
1. Let us take the example of a firm. The revenue cost of a firm can be
recognized on the basis of different factors such as geographical location
of firm’s workplace, products of the firm, advertisements done, time
utilized to flourish a product, etc.
Example 1

2. Let us take the example of the data of a factory which sells products per
quarter in Bangalore. The data is represented in the table given below :

2D factory data
In the above given presentation, the factory’s sales for Bangalore are, for the
time dimension, which is organized into quarters and the dimension of items,
which is sorted according to the kind of item which is sold. The facts here are
represented in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table,
then it is represented in the diagram given below. Here the data of the sales is
represented as a two dimensional table. Let us consider the data according to
item, time and location (like Kolkata, Delhi, Mumbai). Here is the table :

3D data representation as 2D

This data can be represented in the form of three dimensions conceptually,


which is shown in the image below :

3D data representation
Features of multidimensional data models:

Measures: Measures are numerical data that can be analyzed and compared,
such as sales or revenue. They are typically stored in fact tables in a
multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as
time, location, or product. They are typically stored in dimension tables in a
multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships
between measures and dimensions in a data model. They provide a fast and
efficient way to retrieve and analyze data.
Aggregation: Aggregation is the process of summarizing data across
dimensions and levels of detail. This is a key feature of multidimensional data
models, as it enables users to quickly analyze data at different levels of
granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-
level summary of data to a lower level of detail, while roll-up is the opposite
process of moving from a lower-level detail to a higher-level summary. These
features enable users to explore data in greater detail and gain insights into the
underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of
detail. For example, a time dimension might be organized into years, quarters,
months, and days. Hierarchies provide a way to navigate the data and perform
drill-down and roll-up operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional
data model that supports fast and efficient querying of large datasets. OLAP
systems are designed to handle complex queries and provide fast response
times.

Advantages of Multi Dimensional Data Model

The following are the advantages of a multi-dimensional data model :


• A multi-dimensional data model is easy to handle.
• It is easy to maintain.
• Its performance is better than that of normal databases (e.g. relational
databases).
• The representation of data is better than traditional databases. That is
because the multi-dimensional databases are multi-viewed and carry
different types of factors.
• It is workable on complex systems and applications, contrary to the simple
one-dimensional database systems.
• The compatibility in this type of database is an upliftment for projects
having lower bandwidth for maintenance staff.

Disadvantages of Multi Dimensional Data Model

The following are the disadvantages of a Multi Dimensional Data Model :


• The multi-dimensional Data Model is slightly complicated in nature and it
requires professionals to recognize and examine the data in the database.
• During the work of a Multi-Dimensional Data Model, when the system
caches, there is a great effect on the working of the system.
• It is complicated in nature due to which the databases are generally
dynamic in design.
• The path to achieving the end product is complicated most of the time.
• As the Multi Dimensional Data Model has complicated systems, databases
have a large number of databases due to which the system is very insecure
when there is a security break.

What is Data Cube?

When data is grouped or combined in multidimensional matrices called Data


Cubes. The data cube method has a few alternative names or a few variants,
such as "Multidimensional databases," "materialized views," and "OLAP (On-
Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive


computations that are frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and
sale-price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific
attributes are chosen to be measure attributes, i.e., the attributes whose values
are of interest. Another attributes are selected as dimensions or functional
attributes. The measure attributes are aggregated according to the dimensions.

For example, XYZ may create a sales data warehouse to keep records of the
store's sales for the dimensions time, item, branch, and location. These
dimensions enable the store to keep track of things like monthly sales of items,
and the branches and locations at which the items were sold. Each dimension
may have a table identify with it, known as a dimensional table, which describes
the dimensions. For example, a dimension table for items may contain the
attributes item_name, brand, and type.

Data cube method is an interesting technique with many applications. Data


cubes could be sparse in many cases because not every cell in each dimension
may have corresponding data in the database.

Techniques should be developed to handle sparse cubes efficiently.

If a query contains constants at even lower levels than those provided in a data
cube, it is not clear how to make the best use of the precomputed results stored
in the data cube.

The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.

A data cube enables data to be modeled and viewed in multiple dimensions. A


multidimensional data model is organized around a central theme, like sales and
transactions. A fact table represents this theme. Facts are numerical measures.
Thus, the fact table contains measure (such as Rs_sold) and keys to each of the
related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities,
which are used for analyzing the relationship between dimensions.

Example: In the 2-D representation, we will look at the All Electronics sales
data for items sold per quarter in the city of Vancouver. The measured display
in dollars sold (in thousands).

3-Dimensional Cuboids

Let suppose we would like to view the sales data with a third dimension. For
example, suppose we would like to view the data according to time, item as well
as the location for the cities Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands). These 3-D data are shown in
the table. The 3-D data of the table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as
shown in fig:

Let us suppose that we would like to view our sales data with an additional
fourth dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds
the lowest level of summarization is called a base cuboid.

For example, the 4-D cuboid in the figure is the base cuboid for the given time,
item, location, and supplier dimensions.

Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars
sold (in thousands).

The topmost 0-D cuboid, which holds the highest level of summarization, is
known as the apex cuboid. In this example, this is the total sales, or dollars sold,
summarized over all four dimensions.

The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids
creating 4-D data cubes for the dimension time, item, location, and supplier.
Each cuboid represents a different degree of summarization.
Data cube classification:
The data cube can be classified into two categories:
• Multidimensional data cube: It basically helps in storing large amounts of
data by making use of a multi-dimensional array. It increases its efficiency
by keeping an index of each dimension. Thus, dimensional is able to
retrieve data fast.
• Relational data cube: It basically helps in storing large amounts of data by
making use of relational tables. Each relational table displays the
dimensions of the data cube. It is slower compared to a Multidimensional
Data Cube.
Data cube operations:

Data cube operations are used to manipulate data to meet the needs of users.
These operations help to select particular data for the analysis purpose. There
are mainly 5 operations listed below-
• Roll-up: operation and aggregate certain similar data attributes having the
same dimension together. For example, if the data cube displays the daily
income of a customer, we can use a roll-up operation to find the monthly
income of his salary.

• Drill-down: this operation is the reverse of the roll-up operation. It allows


us to take particular information and then subdivide it further for coarser
granularity analysis. It zooms into more detail. For example- if India is an
attribute of a country column and we wish to see villages in India, then the
drill-down operation splits India into states, districts, towns, cities, villages
and then displays the required information.

• Slicing: this operation filters the unnecessary portions. Suppose in a


particular dimension, the user doesn’t need everything for analysis, rather a
particular attribute. For example, country=”jamaica”, this will display only
about jamaica and only display other countries present on the country list.

• Dicing: this operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a certain
range of it. As a result, it looks more like a subcube out of the whole
cube(as depicted in the figure). For example- the user wants to see the
annual salary of Jharkhand state employees.
• Pivot: this operation is very important from a viewing point of view. It
basically transforms the data cube in terms of view. It doesn’t change the
data present in the data cube. For example, if the user is comparing year
versus branch, using the pivot operation, the user can change the viewpoint
and now compare branch versus item type.

Advantages of data cubes:

• Multi-dimensional analysis: Data cubes enable multi-dimensional analysis


of business data, allowing users to view data from different perspectives
and levels of detail.
• Interactivity: Data cubes provide interactive access to large amounts of
data, allowing users to easily navigate and manipulate the data to support
their analysis.
• Speed and efficiency: Data cubes are optimized for OLAP analysis,
enabling fast and efficient querying and aggregation of data.
• Data aggregation: Data cubes support complex calculations and data
aggregation, enabling users to quickly and easily summarize large amounts
of data.
• Improved decision-making: Data cubes provide a clear and
comprehensive view of business data, enabling improved decision-making
and business intelligence.
• Accessibility: Data cubes can be accessed from a variety of devices and
platforms, making it easy for users to access and analyze business data
from anywhere.
• Helps in giving a summarised view of data.
• Data cubes store large data in a simple way.
• Data cube operation provides quick and better analysis,
• Improve performance of data.

Disadvantages of data cube:

• Complexity: OLAP systems can be complex to set up and maintain,


requiring specialized technical expertise.
• Data size limitations: OLAP systems can struggle with very large data sets
and may require extensive data aggregation or summarization.
• Performance issues: OLAP systems can be slow when dealing with large
amounts of data, especially when running complex queries or calculations.
• Data integrity: Inconsistent data definitions and data quality issues can
affect the accuracy of OLAP analysis.
• Cost: OLAP technology can be expensive, especially for enterprise-level
solutions, due to the need for specialized hardware and software.
• Inflexibility: OLAP systems may not easily accommodate changing
business needs and may require significant effort to modify or extend.
What is Star Schema?

A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or
measured, such as a sale or log in. A dimension includes reference data about
the fact, such as date, item, or customer.

A star schema is a relational schema where a relational schema whose design


represents a multidimensional data model. The star schema is the explicit data
warehouse schema. It is known as star schema because the entity-relationship
diagram of this schemas simulates a star, with points, diverge from a central
table. The center of the schema consists of a large fact table, and the points of
the star are the dimension tables.

Fact Tables

A table in a star schema which contains facts and connected to dimensions. A


fact table has two types of columns: those that include fact and those that are
foreign keys to the dimension table. The primary key of the fact tables is
generally a composite key that is made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been
aggregated (fact tables that include aggregated fact are often instead called
summary tables). A fact table generally contains facts with the same level of
aggregation.

Dimension Tables

A dimension is an architecture usually composed of one or more hierarchies that


categorize data. If a dimension has not got hierarchies and levels, it is called
a flat dimension or list. The primary keys of each of the dimensions table are
part of the composite primary keys of the fact table. Dimensional attributes help
to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the
geographic region (markets, cities), clients, products, times, channels.

Characteristics of Star Schema

The star schema is intensely suitable for data warehouse database design
because of the following features:

o It creates a DE-normalized database that can quickly provide query


responses.
o It provides a flexible design that can be changed easily or added to
throughout the development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use
the data.
o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

Star Schemas are easy for end-users and application to understand and navigate.
With a well-designed schema, the customer can instantly analyze large,
multidimensional data sets.

The main advantage of star schemas in a decision-support environment are:


Query Performance

A star schema database has a limited number of table and clear join paths, the
query run faster than they do against OLTP systems. Small single-table queries,
frequently of a dimension table, are almost instantaneous. Large join queries
that contain multiple tables takes only seconds or minutes to run.

In a star schema database design, the dimension is connected only through the
central fact table. When the two-dimension table is used in a query, only one
join path, intersecting the fact tables, exist between those two tables. This
design feature enforces authentic and consistent query results.

Load performance and administration

Structural simplicity also decreases the time required to load large batches of
record into a star schema database. By describing facts and dimensions and
separating them into the various table, the impact of a load structure is reduced.
Dimension table can be populated once and occasionally refreshed. We can add
new facts regularly and selectively by appending records to a fact table.
Built-in referential integrity

A star schema has referential integrity built-in when information is loaded.


Referential integrity is enforced because each data in dimensional tables has a
unique primary key, and all keys in the fact table are legitimate foreign keys
drawn from the dimension table. A record in the fact table which is not related
correctly to a dimension cannot be given the correct key value to be retrieved.

Easily Understood

A star schema is simple to understand and navigate, with dimensions joined


only through the fact table. These joins are more significant to the end-user
because they represent the fundamental relationship between parts of the
underlying business. Customer can also browse dimension table attributes
before constructing a query.

Disadvantage of Star Schema

There is some condition which cannot be meet by star schemas like the
relationship between the user, and bank account cannot describe as star schema
as the relationship between them is many to many.

Example: Suppose a star schema is composed of a fact table, SALES, and


several dimension tables connected to it for time, branch, item, and geographic
locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM
table has columns for each item_Key, item_name, brand, type, supplier_type.
The BRANCH table has columns for each branch_key, branch_name,
branch_type. The LOCATION table has columns of geographic data, including
street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four
columns for time data, four columns for ITEM data, three columns for
BRANCH data, and four columns for LOCATION data. Thus, the size of the
fact table is significantly reduced. When we need to change an item, we need
only make a single change in the dimension table, instead of making many
changes in the fact table.

We can create even more complex star schemas by normalizing a dimension


table into several tables. The normalized dimension table is called a Snowflake.

What is Snowflake Schema?


A snowflake schema is equivalent to the star schema. "A schema is known as a
snowflake if one or more dimension tables do not connect directly to the fact
table but must join through other dimension tables."

The snowflake schema is an expansion of the star schema where each point of
the star explodes into more points. It is called snowflake schema because the
diagram of snowflake schema resembles a snowflake. Snowflaking is a method
of normalizing the dimension tables in a STAR schemas. When we normalize
all the dimension tables entirely, the resultant structure resembles a snowflake
with the fact table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema


is diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake
pattern.

The snowflake schema consists of one fact table which is linked to many
dimension tables, which can be linked to other dimension tables through a
many-to-one relationship. Tables in a snowflake schema are generally
normalized to the third normal form. Each dimension table performs exactly one
level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each
having three levels. A snowflake schemas can have any number of dimension,
and each dimension can have any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store,
Location, Time, Product, Line, and Family dimension tables. The Market
dimension has two dimension tables with Store as the primary dimension table,
and Location as the outrigger dimension table. The product dimension has three
dimension tables with Product as the primary dimension table, and the Line and
Family table are the outrigger dimension tables.

A star schema store all attributes for a dimension into one denormalized table.
This needed more disk space than a more normalized snowflake schema.
Snowflaking normalizes the dimension by moving attributes with low
cardinality into separate dimension tables that relate to the core dimension table
by using foreign keys. Snowflaking for the sole purpose of minimizing disk
space is not recommended, because it can adversely impact query performance.

In snowflake, schema tables are normalized to delete redundancy. In snowflake


dimension tables are damaged into multiple dimension tables.

Figure shows a simple STAR schema for sales in a manufacturing company.


The sales fact table include quantity, price, and other relevant metrics.
SALESREP, CUSTOMER, PRODUCT, and TIME are the dimension tables.

The STAR schema for sales, as shown above, contains only five tables, whereas
the normalized version now extends to eleven tables. We will notice that in the
snowflake schema, the attributes with low cardinality in each original
dimension tables are removed to form separate tables. These new tables are
connected back to the original dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex
dimensions and relationship. It is suitable for many to many and one to many
relationships between dimension levels.

Advantage of Snowflake Schema


1. The primary advantage of the snowflake schema is the development in
query performance due to minimized disk storage requirements and
joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension
levels and components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema


1. The primary disadvantage of the snowflake schema is the additional
maintenance efforts required due to the increasing number of lookup
tables. It is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
What Is a Snowflake Schema?

This particular kind of data warehouse schema is shaped like a snowflake. The
snowflake schema aims to normalize the star schema's denormalized data.
When the star schema's dimensions are intricate, highly structured, and have
numerous degrees of connection, and the kid tables have several parent tables,
the snowflake structure emerges. Some of the star schema's common issues are
resolved by the snowflake schema.

The snowflake schema can be thought of as a "multi-dimensional" structure. A


snowflake schema's central component comprises Fact Tables that link the data
inside the Dimension Tables, which then radiate outward like the Star Schema.
The snowflake schema, on the other hand, divides the Dimension Tables into
several tables, resulting in a snowflake pattern. Up until they are fully
normalized, the Dimension Tables are split across multiple tables.

Characteristics of Snowflake Schema

The snowflake schema is characterized by a normalized data structure, with data


divided into smaller, more specialized tables that are related to each other
through foreign keys.

These are its main characteristics:

• Small disc space is required by the snowflake schema.


• The new dimension to the schema is simple to implement.
• Performance is impacted because there are numerous tables.
• Two or even more sets of attributes that describe data at various grains make
up the dimension table.
• A single dimension table's sets of characteristics are filled in by various
source systems.
Now that we have a basic understanding of the snowflake schema, let's dive into
the specifics of the star schema and explore what sets it apart from other data
organization techniques.

What Is a Star Schema?

The star schema is the most straightforward method for arranging data in the
data warehouse. Any or even more Fact Tables that index a number of
Dimension Tables may be present in the star schema's central area. Dimensions
Keys, Values, and Attributes are found in Dimension Tables, which are used to
define Dimensions.

The star schema's objective is to distinguish between the descriptive or


"DIMENSIONAL" data and the numerical "FACT" data that pertains to a
business.

The information displayed in a numerical format, such as cost, speed, weight,


and quantity, might be considered fact data. Along with numbers, dimensional
data can also contain non-numerical elements like colors, places, names of
salespeople and employees, etc.

While the Dimension Data is contained inside the Dimension Tables, the Fact
Data is arranged within the Fact Tables. In a star schema, the Fact Tables are
the integrating points at the core of a star.

Characteristics of Star Schema

The star schema is characterized by a denormalized data structure, with all data
related to a particular subject stored in a single large table and connected to
smaller, dimensional tables through a single join.

These are some of the main characteristics of the star schema:

• A single one-dimension table can represent each aspect in a star schema.


• The collection of attributes should be in the dimension table.
• Using a foreign key, the dimensions table is connected to the fact table.
• No connections are made between the dimension tables.
• Key and measure would be in the fact table.
• The Star schema offers the best possible disc use and is simple to grasp.
• Tables for the dimensions are not standardized. As an OLTP architecture
would have it, the Country ID in the image above does not have a Country
lookup table.
• BI Tools provide extensive support for the schema.
With a foundational understanding of the snowflake and star schema under our
belts, it's time to explore the key differences between the two.

Star Schema vs. Snowflake Schema

Basis of
Star Schema Snowflake Schema
Distinction

Both fact tables and dimension Dimension tables, sub-dimension


Definition and
tables are present in a star tables, and fact tables are all included in
Meaning
schema. a snowflake schema.

The star schema is a top-down The snowflake schema is a bottom-up


Type of Model
type of model. type of model.

Star schema uses more space


Snowflake schema uses less space
Space compared to Snowflake
comparatively.
Schema.
The snowflake schema has a complex
In a star schema, relationships
data structure with multiple levels of
between tables are represented
relationships between tables,
by a single join, resulting in a
Joint Relations represented by multiple joins. This can
simple data structure for fast
make the data structure more difficult to
query performance and easy
understand and result in slower query
data analysis.
performance.

Star schemas have faster query


Snowflake schemas require complex
execution times due to a single
Response Time joins between tables, which can slow
join of a fact table and its
for Queries down query processing and impact
attributes in dimensional
other OLAP products.
tables.

In a star schema, dimension


tables are not organized in a
normalized form. They are
Dimension tables in snowflake schema
Normalization typically denormalized and
are normalized.
contain multiple levels of
information about a particular
subject in a single table.

Has a simpler design


Design More complex design compared to star
compared to snowflake
Complexity schema.
schema.

Star schemas have simpler Snowflake schemas, on another hand,


Query query design due to the fact have a more complex query design due
Complexity the table is joined to only one to the need for multiple joins between
level of dimensional tables. the fact table and its dimensional tables.
This leads to additional overhead in
query writing.

It is simpler to understand
Understanding More complex to understand compared
compared to snowflake
Complexity to star schema.
schema.

Have a lesser number of


Foreign Keys Comparatively has more foreign keys.
foreign keys.

The star schema stores The snowflake design fully normalizes


Data
redundant data in the the dimension tables and prevents data
Redundancy
dimension tables. redundancy,

• Simple and easy-to-


• Normalized data structure reduces
understand data structure.
redundancy and increases data
• Fast query performance due integrity.
to the single join between
• Allows for more complex
the fact table and its
Advantages relationships between data.
dimensional tables.
• Allows for easier data maintenance
• Suitable for large volumes
and management.
of data
• Good for more structured predictable
• Good for ad-hoc querying
querying.
and data analysis.

• The star schema has a • The more complex data structure can
Disadvantages limited ability to depict be harder to understand and work
complex relationships with.
between data.
• Multiple joins between tables can
• Can suffer from data result in slower query performance.
redundancy and decreased
• Requires more storage and
data integrity.
processing resources due to the
• May not be suitable for larger number of tables.
smaller volumes of data.

What is Fact Constellation Schema?

A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or


data mart. Fact Constellation Schema can design with a collection of de-
normalized FACT, Shared, and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to


summarize information. Fact Constellation Schema can implement between
aggregate Fact tables or decompose a complex Fact table into independent
simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.


This schema defines two fact tables, sales, and shipping. Sales are treated along
four dimensions, namely, time, item, branch, and location. The schema contains
a fact table for sales that includes keys to each of the four dimensions, along
with two measures: Rupee_sold and units_sold. The shipping table has five
dimensions, or keys: item_key, time_key, shipper_key, from_location, and
to_location, and two measures: Rupee_cost and units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more


challenging design because many variants for specific kinds of aggregation
must be considered and selected.

Data Warehouse Applications

The application areas of the data warehouse are:


Information Processing

It deals with querying, statistical analysis, and reporting via tables, charts, or
graphs. Nowadays, information processing of data warehouse is to construct a
low cost, web-based accessing tools typically integrated with web browsers.

Analytical Processing

It supports various online analytical processing such as drill-down, roll-up, and


pivoting. The historical data is being processed in both summarized and detailed
format.

OLAP is implemented on data warehouses or data marts. The primary objective


of OLAP is to support ad-hoc querying needed for support DSS. The
multidimensional view of data is fundamental to the OLAP application. OLAP
is an operational view, not a data structure or schema. The complex nature of
OLAP applications requires a multidimensional view of the data.

Data Mining
It helps in the analysis of hidden design and association, constructing scientific
models, operating classification and prediction, and performing the mining
results using visualization tools.

Data mining is the technique of designing essential new correlations, patterns,


and trends by changing through high amounts of a record save in repositories,
using pattern recognition technologies as well as statistical and mathematical
techniques.

It is the phase of selection, exploration, and modeling of huge quantities of


information to determine regularities or relations that are at first unknown to
access precise and useful results for the owner of the database.

It is the process of inspection and analysis, by automatic or semi-automatic


means, of large quantities of records to discover meaningful patterns and rules.

Data Warehouse Process Architecture

The process architecture defines an architecture in which the data from the data
warehouse is processed for a particular computation.

Following are the two fundamental process architectures:


Centralized Process Architecture

In this architecture, the data is collected into single centralized storage and
processed upon completion by a single machine with a huge structure in terms
of memory, processor, and storage.

Centralized process architecture evolved with transaction processing and is well


suited for small organizations with one location of service.

It requires minimal resources both from people and system perspectives.

It is very successful when the collection and consumption of data occur at the
same location.

Distributed Process Architecture

In this architecture, information and its processing are allocated across data
centers, and its processing is distributed across data centers, and processing of
data is localized with the group of the results into centralized storage.
Distributed architectures are used to overcome the limitations of the centralized
process architectures where all the information needs to be collected to one
central location, and results are available in one central location.
There are several architectures of the distributed process:

Client-Server

In this architecture, the user does all the information collecting and presentation,
while the server does the processing and management of data.

Three-tier Architecture

With client-server architecture, the client machines need to be connected to a


server machine, thus mandating finite states and introducing latencies and
overhead in terms of record to be carried between clients and servers.

N-tier Architecture

The n-tier or multi-tier architecture is where clients, middleware, applications,


and servers are isolated into tiers.

Cluster Architecture

In this architecture, machines that are connected in network architecture


(software or hardware) to approximately work together to process information
or compute requirements in parallel. Each device in a cluster is associated with
a function that is processed locally, and the result sets are collected to a master
server that returns it to the user.

Peer-to-Peer Architecture

This is a type of architecture where there are no dedicated servers and clients.
Instead, all the processing responsibilities are allocated among all machines,
called peers. Each machine can perform the function of a client or server or just
process data.

Types of Database Parallelism

Parallelism is used to support speedup, where queries are executed faster


because more resources, such as processors and disks, are provided. Parallelism
is also used to provide scale-up, where increasing workloads are managed
without increase response-time, via an increase in the degree of parallelism.

Different architectures for parallel database systems are shared-memory, shared-


disk, shared-nothing, and hierarchical structures.

(a)Horizontal Parallelism: It means that the database is partitioned across


multiple disks, and parallel processing occurs within a specific task (i.e., table
scan) that is performed concurrently on different processors against different
sets of data.

(b)Vertical Parallelism: It occurs among various tasks. All component query


operations (i.e., scan, join, and sort) are executed in parallel in a pipelined
fashion. In other words, an output from one function (e.g., join) as soon as
records become available.

Intraquery Parallelism

Intraquery parallelism defines the execution of a single query in parallel on


multiple processors and disks. Using intraquery parallelism is essential for
speeding up long-running queries.

Interquery parallelism does not help in this function since each query is run
sequentially.

To improve the situation, many DBMS vendors developed versions of their


products that utilized intraquery parallelism.

This application of parallelism decomposes the serial SQL, query into lower-
level operations such as scan, join, sort, and aggregation.

These lower-level operations are executed concurrently, in parallel.


Interquery Parallelism

In interquery parallelism, different queries or transaction execute in parallel


with one another.

This form of parallelism can increase transactions throughput. The response


times of individual transactions are not faster than they would be if the
transactions were run in isolation.

Thus, the primary use of interquery parallelism is to scale up a transaction


processing system to support a more significant number of transactions per
second.

Database vendors started to take advantage of parallel hardware architectures by


implementing multiserver and multithreaded systems designed to handle a large
number of client requests efficiently.

This approach naturally resulted in interquery parallelism, in which different


server threads (or processes) handle multiple requests at the same time.

Interquery parallelism has been successfully implemented on SMP systems,


where it increased the throughput and allowed the support of more concurrent
users.

Shared Disk Architecture

Shared-disk architecture implements a concept of shared ownership of the entire


database between RDBMS servers, each of which is running on a node of a
distributed memory system.

Each RDBMS server can read, write, update, and delete information from the
same shared database, which would need the system to implement a form of a
distributed lock manager (DLM).

DLM components can be found in hardware, the operating system, and separate
software layer, all depending on the system vendor.

On the positive side, shared-disk architectures can reduce performance


bottlenecks resulting from data skew (uneven distribution of data), and can
significantly increase system availability.

The shared-disk distributed memory design eliminates the memory access


bottleneck typically of large SMP systems and helps reduce DBMS dependency
on data partitioning.
Shared-Memory Architecture

Shared-memory or shared-everything style is the traditional approach of


implementing an RDBMS on SMP hardware.

It is relatively simple to implement and has been very successful up to the point
where it runs into the scalability limitations of the shared-everything
architecture.

The key point of this technique is that a single RDBMS server can probably
apply all processors, access all memory, and access the entire database, thus
providing the client with a consistent single system image.
In shared-memory SMP systems, the DBMS considers that the multiple
database components executing SQL statements communicate with each other
by exchanging messages and information via the shared memory.

All processors have access to all data, which is partitioned across local disks.

Shared-Nothing Architecture

In a shared-nothing distributed memory environment, the data is partitioned


across all disks, and the DBMS is "partitioned" across multiple co-servers, each
of which resides on individual nodes of the parallel system and has an
ownership of its disk and thus its database partition.

A shared-nothing RDBMS parallelizes the execution of a SQL query across


multiple processing nodes.

Each processor has its memory and disk and communicates with other
processors by exchanging messages and data over the interconnection network.

This architecture is optimized specifically for the MPP and cluster systems.

The shared-nothing architectures offer near-linear scalability. The number of


processor nodes is limited only by the hardware platform limitations (and
budgetary constraints), and each node itself can be a powerful SMP system.
Data Warehouse Tools

The tools that allow sourcing of data contents and formats accurately and
external data stores into the data warehouse have to perform several essential
tasks that contain:

o Data consolidation and integration.


o Data transformation from one form to another form.
o Data transformation and calculation based on the function of business
rules that force transformation.
o Metadata synchronization and management, which includes storing or
updating metadata about source files, transformation actions, loading
formats, and events.

There are several selection criteria which should be considered while


implementing a data warehouse:

1. The ability to identify the data in the data source environment that can be
read by the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in
many installations.
4. The specification interface to indicate the information to be extracted and
conversation are essential.
5. The ability to read information from repository products or data
dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to
extract only the required data.
8. A field-level data examination for the transformation of data into
information is needed.
9. The ability to perform data type and the character-set translation is a
requirement when moving data between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and
records are necessary.
11. Vendor stability and support for the products are components that must
be evaluated carefully.

Data Warehouse Software Components

A warehousing team will require different types of tools during a warehouse


project. These software products usually fall into one or more of the categories
illustrated, as shown in the figure.
Extraction and Transformation

The warehouse team needs tools that can extract, transform, integrate, clean,
and load information from a source system into one or more data warehouse
databases. Middleware and gateway products may be needed for warehouses
that extract a record from a host-based source system.

Warehouse Storage

Software products are also needed to store warehouse data and their
accompanying metadata. Relational database management systems are well
suited to large and growing warehouses.

Data access and retrieval

Different types of software are needed to access, retrieve, distribute, and present
warehouse data to its end-clients.

You might also like