0% found this document useful (0 votes)
38 views

Notes Format

Uploaded by

boopathiu.csbs
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Notes Format

Uploaded by

boopathiu.csbs
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 132

Department of Computer Science and Business Systems

UNIT I INTRODUCTION TO DATA WAREHOUSE 5


Data warehouse Introduction - Data warehouse components- operational database Vs data warehouse –
Data warehouse Architecture – Three-tier Data Warehouse Architecture - Autonomous Data Warehouse-
Autonomous Data Warehouse Vs Snowflake - Modern Data Warehouse.

1.1 Data warehouse Introduction:


“A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in
support of management’s decision-making process”.

Subject-oriented: A data warehouse is systematized around major subjects such as customer, supplier,
product, and sales. Rather than focussed on the day-to-day operations and transaction processing of an
organization, a data warehouse emphases on the modeling and analysis of data for decision-
makers. Henceforth, data warehouses usually provide a simple and succinct view of particular subject
issues by excluding data that are not useful in the decision support process.

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such
as relational databases, flat files, and online transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding
structures, attribute measures, and so on.

Time-variant: Data is stored to provide information from a historic perspective (e.g., the past 5–
10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.

Non-volatile: A data warehouse is always a physically separate store of data transformed from
the application data found in the operational environment. Due to this separation, a data warehouse does
not require transaction processing, recovery, and concurrency control mechanisms. It usually requires
only two operations in data accessing: initial loading of data and access of data.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Features of a Data Warehouse


• It is separate from the Operational Database.
• Integrates data from heterogeneous systems.
• Stores HUGE amount of data, more historical than current data.
• Does not require data to be highly accurate.
• Queries are generally complex.
• The goal is to execute statistical queries and provide results that can influence decision-making in
favor of the Enterprise.
• These systems are thus called Online Analytical Processing Systems (OLAP).

The need for a Separate Data Warehouse:


An operational database is designed and tuned from known tasks and workloads like indexing and
hashing using primary keys, searching for particular records, and optimizing “canned” queries. On

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

the other hand, data warehouse queries are often complex. They involve the computation of large data
groups at summarized levels and may require the use of special data organization, access, and
implementation methods based on multidimensional views. Processing OLAP queries in operational
databases would substantially degrade the performance of operational tasks.
Moreover, an operational database supports the concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms (e.g., locking and logging) are required
to ensure the consistency and robustness of transactions. An OLAP query often needs read-only access
to data records for summarization and aggregation. Concurrency control and recovery mechanisms, if
applied for such OLAP operations, may jeopardize the execution of concurrent transactions and thus
substantially reduce the throughput of an OLTP system.
Finally, the separation of operational databases from data warehouses is based on the
different structures, contents, and uses of the data in these two systems

Data Warehouse Models


From the architecture point of view, there are three data warehouse models: the enterprise warehouse, the
data mart, and the virtual warehouse.

• Enterprise warehouse: An enterprise warehouse collects all of the information about


subjects spanning the entire organization. It provides corporate-wide data integration, usually from
one or more operational systems or external information providers, and is cross-functional in
scope. It typically contains detailed data as well as summarized data and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional
mainframes, computer super servers, or parallel architecture platforms. It requires extensive
business modeling and may take years to design and build.
• Datamart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects. For example, a
marketing data mart may confine its subjects to a customer, item, and sales. The data contained in
data marts tend to be summarized.
Depending on the source of data, data marts can be categorized into the following
two classes:
1. Independent data marts are sourced from data captured from one or more operational
systems or external information providers, or data generated locally within a particular department
or geographic area.
2. Dependent data marts are sourced directly from enterprise data warehouses.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

• Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database servers.

Data Warehouse Modeling: Data Cube


The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them. Such a data model is
appropriate for on-line transaction processing. The data warehouse requires a concise, subject
oriented schema that facilitates OLAP. Data warehouses and OLAP tools are based on a
multidimensional data model. This model views data in the form of a data cube. A data cube allows
data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the
given dimensions. The result would form a lattice of cuboids, each showing the data at a
different level of summarization or group-by.
The lattice of cuboids is then referred to as a data cube. Figure 4 shows a lattice of cuboids forming a
data cube for the dimensions time, item, location, and supplier.

The cuboid that holds the lowest level of summarization is called the base cuboid. For example,
the 4-D cuboid in Figure 3 is the base cuboid for the given time, item, location, and supplier dimensions.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Figure 2 is a 3-D (non-base) cuboid for time, item, and location summarized for all suppliers. The 0- D
cuboid, which holds the highest level of summarization, is called the apex cuboid. In our example, this is

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

the total sales, or dollars sold, summarized over all four dimensions. The apex cuboid is typically denoted
by all.”

Conceptual Modeling of Data Warehouse:


The most popular data model for a data warehouse is a multidimensional model, which can exist
in the form of a star schema, a snowflake schema, or a fact constellation schema.

Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.
A star schema for AllElectronics sales is shown in Figure 5. Sales are considered along
four dimensions: time, item, branch, and location. The schema contains a central fact table for
sales that contain keys to each of the four dimensions, along with two measures: dollars sold and units
sold.
To minimize the size of the fact table, dimension identifiers (e.g., time key and item key)
are system generated identifiers. Notice that in the star schema, each dimension is represented by only
one table, and each table contains a set of attributes

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph
forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension tables
of the snowflake model may be kept in the normalized form to reduce redundancies. Such a table is easy
to maintain and saves storage space.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation

A fact constellation schema is shown in Figure 7. This schema specifies two fact tables, sales, and
shipping. The sales table definition is identical to that of the star schema (Figure 5). The shipping table
has five dimensions, or keys—item key, time key, shipper key, from location, and to location— and two
measures—dollars cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact tables. For
example, the dimensions tables for time, item, and location are shared between the sales and shipping fact
tables.

Concept Hierarchies
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts. Consider a concept hierarchy for the dimension location.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Top-Down Approach
In this approach the data in the data warehouse is stored at the lowest level of granularity
based on a normalized data model. The centralized data warehouse would feed the dependent data marts
that may be designed based on a dimensional data model.
• The advantages of this approach are:
• A truly corporate effort, an enterprise view of data
• Inherently architected, not a union of disparate data marts
• Single, central storage of data about the content
• Centralized rules and control
• May see quick results if implemented with iterations
• The disadvantages are:
• Takes longer to build even with an iterative method
• High exposure to risk of failure
• Needs high level of cross-functional skills
• High outlay without proof of concept

Bottom-Up Approach
In this approach data marts are created first to provide analytical and reporting capabilities
for specific business subjects based on the dimensional data model.

The advantages of this approach are:


• Faster and easier implementation of manageable pieces
• Favorable return on investment and proof of concept
• Less risk of failure
• Inherently incremental; can schedule important data marts first

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

• Allows project team to learn and grow

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

The disadvantages are:


• Each data mart has its own narrow view of data
• Permeates redundant data in every data mart
• Perpetuates inconsistent and irreconcilable data
• Proliferates unmanageable interfaces

Measures: Their Categorization and Computation


A data cube measure is a numeric function that can be evaluated at each point in the data cube
space. A measured value is computed for a given point by aggregating the data corresponding to
the respective dimension-value pairs defining the given point.

• Distributive: An aggregate function is distributive if it can be computed in a distributed manner


as follows. Suppose the data are partitioned into n sets. We apply the function to each partition,
resulting in n aggregate values. If the result derived by applying the function to the n aggregate
values is the same as that derived by applying the function to the entire data set (without
partitioning), the function can be computed in a distributed manner. For example, sum() can be
computed for a data cube by first partitioning the cube into a set of sub-cubes, computing sum()
for each sub-cube, and then summing up the counts obtained for each sub-cube. Hence, sum() is a
distributive aggregate function. For the same reason, count(), min(), and max() are
distributive aggregate functions.

• Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function with


M arguments (where M is a bounded positive integer), each of which is obtained by applying a
distributive aggregate function. For example, avg() (average) can be computed by sum()/count(),
where both sum() and count() are distributive aggregate functions. Similarly, it can be shown that
min N() and max N() (which find the N minimum and N maximum values, respectively,
in a given set) and standard deviation() are algebraic aggregate functions. A measure is algebraic
if it is obtained by applying an algebraic aggregate function.

• Holistic: An aggregate function is holistic if there is no constant bound on the storage size needed
to describe a sub-aggregate. That is, there does not exist an algebraic function with M arguments
(where M is a constant) that characterizes the computation. Common examples of holistic
functions include median(), mode(), and rank(). A measure is holistic if it is obtained by applying
a holistic aggregate function.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Data Warehouse Design and Usage


To design an effective data warehouse we need to understand and analyze business needs
and construct a business analysis framework. The construction of a large and complex information system
can be viewed as the construction of a large and complex building, for which the owner, architect, and
builder have different views.
Four different views regarding a data warehouse design must be considered: the top-down view,
the data source view, the data warehouse view, and the business query view.

• The top-down view allows the selection of the relevant information necessary for the data
warehouse. This information matches current and future business needs.
• The data source view exposes the information being captured, stored, and managed
by operational systems. This information may be documented at various levels of detail
and accuracy, from individual data source tables to integrated data source tables. Data
sources are often modeled by traditional data modeling techniques, such as the entity-
relationship model or CASE (computer-aided software engineering) tools.
• The data warehouse view includes fact tables and dimension tables. It represents the information
that is stored inside the data warehouse, including pre-calculated totals and counts, as well
as information regarding the source, date, and time of origin, added to provide historical context.
• Finally, the business query view is the data perspective in the data warehouse from the
end-
user’s viewpoint.

Building and using a data warehouse is a complex task because it requires business skills, technology
skills, and program management skills.
Regarding business skills, building a data warehouse involves understanding how systems store and
manage their data, how to build extractors that transfer data from the operational system to the
data warehouse, and how to build warehouse refresh software that keeps the data warehouse reasonably
up- to-date with the operating system’s data.
Technology skills, data analysts are required to understand how to make assessments
from quantitative information and derive facts based on conclusions from historic information in
the data warehouse. These skills include the ability to discover patterns and trends, extrapolate
trends based on history and look for anomalies or paradigm shifts, and to present coherent managerial
recommendations based on such analysis.
Finally, program management skills involve the need to interface with many technologies, vendors,
and end-users to deliver results in a timely and cost effective manner.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Data Warehouse Design Process


Data warehouse can be built using a top-down approach, a bottom-up approach, or a
combination of both.
• The top-down approach starts with overall design and planning. It is useful in cases where the
technology is mature and well known, and where the business problems that must be solved are
clear and well understood.
• The bottom-up approach starts with experiments and prototypes. This is useful in the early stage
of business modeling and technology development. It allows an organization to move forward at
considerably less expense and to evaluate the technological benefits before making
significant commitments.
• In the combined approach, an organization can exploit the planned and strategic nature of the top-
down approach while retaining the rapid implementation and opportunistic application of
the bottom-up approach.

In general, the warehouse design process consists of the following steps:


1. Choose a business process to model (e.g., orders, invoices, shipments, inventory,
account administration, sales, or the general ledger). If the business process is organizational and
involves multiple complex object collections, a data warehouse model should be followed. However, if
the process is departmental and focuses on the analysis of one kind of business process, a data mart model
should be chosen.
2. Choose the business process grain, which is the fundamental, atomic level of data to be represented in
the fact table for this process (e.g., individual transactions, individual daily snapshots, and so on).
3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item,
customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are numeric additive
quantities like dollars sold and units sold.

The goals of an initial data warehouse implementation should be specific, achievable, and measurable.

Once a data warehouse is designed and constructed, the initial deployment of the
warehouse includes initial installation, roll-out planning, training, and orientation. Data warehouse
administration includes data refreshment, data source synchronization, planning for disaster recovery,
managing access control and security, managing data growth, managing database performance, and
data warehouse enhancement and extension.
Various kinds of data warehouse design tools are available. Datawarehouse development tools

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

provide functions to define and edit metadata repository contents (e.g., schemas, scripts, or rules), answer

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

queries, output reports, and ship metadata to and from relational database system catalogs. Planning and
analysis tools study the impact of schema changes and refresh performance when changing refresh rates
or time windows.
Data Warehouse Usage for Information Processing
Data warehouses and data marts are used in a wide range of applications. There are three kinds of
data warehouse applications: information processing, analytical processing, and data mining.
Information processing supports querying, basic statistical analysis, and reporting using crosstabs,
tables, charts, or graphs. A current trend in data warehouse information processing is to construct low-
cost web-based accessing tools that are then integrated with web browsers.
Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and
pivoting. It generally operates on historic data in both summarized and detailed forms. The major strength
of online analytical processing over information processing is the multidimensional data analysis of data
warehouse data.
Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction, and presenting the mining
results using visualization tools. “How does data mining relate to information processing and
online analytical processing?” Information processing, based on queries, can find useful information.
However, answers to such queries reflect the information directly stored in databases or computable by
aggregate functions. They do not reflect sophisticated patterns or regularities buried in the database.
Therefore, information processing is not data mining.

Depending on the type and capacities of a warehouse, it can become home to structured, semi-structured,
or unstructured data.
 Structured data is highly-organized and commonly exists in a tabular format like Excel files.
 Unstructured data comes in all forms and shapes from audio files to PDF documents and doesn’t
have a pre-defined structure.
 Semi-structured data is somewhere in the middle, meaning it is partially structured but doesn't
fit the tabular models of relational databases. Examples are JSON, XML, and Avro files.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Data Warehouse Tools


The tools that allow sourcing of data contents and formats accurately and external data stores into the
data warehouse have to perform several essential tasks that contain:
o Data consolidation and integration.
o Data transformation from one form to another form.
o Data transformation and calculation based on the function of business rules that
force transformation.
o Metadata synchronization and management, which includes storing or updating metadata
about source files, transformation actions, loading formats, and events.
There are several selection criteria which should be considered while implementing a data warehouse:
SubCode:CCS341 Subject Name :Data Warehousing
Department of Computer Science and Business Systems

1. The ability to identify the data in the data source environment that can be read by the
tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in many installations.
4. The specification interface to indicate the information to be extracted and conversation are
essential.
5. The ability to read information from repository products or data dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to extract only the required
data.
8. A field-level data examination for the transformation of data into information is needed.
9. The ability to perform data type and the character-set translation is a requirement when moving
data between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and records are necessary.
11. Vendor stability and support for the products are components that must be evaluated carefully.

History of Data Warehouse


The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and
Paul
Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for the flow
of information from the operational system to decisional support environments. The concept
attempt to address the various problems associated with the flow, mainly the high costs associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to support multiple
decision support environments. In large corporations, it was ordinary for various decision support
environments to operate independently.

Goals of Data Warehousing


o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.

Need for Data Warehouse


Data Warehouse is needed for the following reasons:

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

1. 1) Business User: Business users require a data warehouse to view summarized data from
the past. Since these people are non-technical, the data may be presented to them in an elementary
form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a commonplace,
the user can effectively undertake to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick response time.

Benefits of Data Warehouse


1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand,
and query.
4. Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of
users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
SubCode:CCS341 Subject Name :Data Warehousing
Department of Computer Science and Business Systems

Advantages of a Data Warehouse:


o Data warehouses facilitate end users' access to a variety of data.
o Assist in the operation of applications for decision support systems such as trend reports,
for instance, obtaining the products that have sold the most in a specific area over the past two
years; exception reports, reports that compare the actual outcomes to the predetermined goals.
o Using numerous data warehouses can increase the operational value of business systems,
especially customer relationship management.
o Makes selections with higher quality.
o For the medium and long term, it is especially helpful.
o Installing these systems is quite straightforward if the data sources and goals are clear.
o Storage of analyses and historical search queries is quite beneficial.
o It has a strong capacity for digesting information.
o Access to information is made more flexible and quick because of it.
o Allows for easier corporate decision-making.
o The productivity of businesses rises.
o Gives the company's many departments reliable communication.
o Strengthen connections with customers and suppliers.
o It makes it possible to keep up with business activity and be constantly informed of successful and
unsuccessful outcomes.
o transforms information into knowledge and data into information
o You can plan more successfully, thanks to it.
o Cut back on operating expenses and response times.
o The Data Warehouse assists in fusing many data sources, lessening the production system's
workload.
o The data warehouse aids in reducing the overall turnaround time for reporting and research.
o Documentation and review are made easier for the consumer through restructuring
and convergence.
o Users have single-point access to multiple sources of private data thanks to the data warehouse.
Additionally, it saves users time when they access data from numerous sources.
o In a data center, a significant volume of old data is kept. Users can compare various eras
and trends to create potential predictions using this.

Disadvantages of a Data Warehouse:


o The data warehouses may project substantial expenditures throughout his life. The data warehouse
is typically not stationary. Costs for maintenance are considerable.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

o Data warehouses could soon become outdated.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

o They occasionally need to provide complete information before a request for information, which
also costs the organization money.
o Between data warehouses and operational systems, there is frequently a fine line. It is necessary to
determine which of these features can be used and which ones should be implemented in the data
warehouse since it would be expensive to carry out needless activities or to stop carrying
out those that would be required.
o It could be more useful for making decisions in real-time due to the prolonged processing time it
may require. In any event, the trend of modern products (along with technological advancements)
addresses this issue by turning the drawback into a benefit.
o Regarding the various objectives a company seeks to achieve, challenges may arise during
implementation.
o It might be challenging to include new data sources once a system has been implemented.
o They necessitate an examination of the data model, objects, transactions, and storage.
o They were designed in a sophisticated, multidisciplinary manner.
o The operating systems must be reorganized to accommodate them.
o Data centers are excellent systems for maintenance. Any source systems and business
process restructuring could influence the data warehouse, resulting in significant maintenance
costs.
o The data warehouse may seem simple, but it is too complex for the typical person to comprehend.
o The scope of the data storage project will start to expand, despite the best efforts of
project management.
o At this point, various business regulations may already be in place for warehouse clients.
o Uniformization of data Similar data formats in many data sources are another topic data
warehousing covers. The loss of some important data components could be the outcome.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

1.2 Data warehouse components

 Source Data Component:


Source data coming into the data warehouse may be grouped into four broad categories

1. Production Data This category of data comes from the various operational systems of the enterprise.
These normally include financial systems, manufacturing systems, systems along the supply chain, and
customer relationship management systems. Based on the information requirements in the
data warehouse, you choose segments of data from the different operational systems.

2. Internal Data In every organization, users keep their “private” spreadsheets, documents,
customer profiles, and sometimes even departmental databases. This is the internal data, parts of which
could be useful in a data warehouse.

3. Archived Data Operational systems are primarily intended to run the current business. In
every operational system, you periodically take the old data and store it in archived files. The
circumstances in your organization dictate how often and which portions of the operational
databases are archived for storage. Some data is archived after a year. Sometimes data is left in the
operational system databases for as long as five years. Much of the archived data comes from old legacy
systems that are nearing the end of their useful lives in organizations.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

4. External Data Most executives depend on data from external sources for a high percentage of
the information they use. They use statistics relating to their industry produced by external
agencies and national statistical offices. They use market share data of competitors. They use
standard values of financial indicators for their business to check on their performance.
For example, the data warehouse of a car rental company contains data on the current production
schedules of the leading automobile manufacturers. This external data in the data warehouse helps the car
rental company plan for its fleet management.
 Data Staging Component
After you have extracted data from various operational systems and from external sources, you have
to prepare the data for storing in the data warehouse. The extracted data coming from several disparate
sources needs to be changed, converted, and made ready in a format that is suitable to be stored
for querying and analysis. Three major functions need to be performed for getting the data ready. You
have to extract the data, transform the data, and then load the data into the data warehouse storage.

1. Data Extraction: This function has to deal with numerous data sources. You have to employ
the appropriate technique for each data source. Source data may be from different source machines in
diverse data formats. Part of the source data may be in relational database systems. Some data may be on
other legacy network and hierarchical data models.
Many data sources may still be in flat files. You may want to include data from spreadsheets and
local departmental data sets. Data extraction may become quite complex. Tools are available on
the market for data extraction. You may want to consider using outside tools suitable for certain data
sources.

2. Data Transformation: Data transformation involves many forms of combining pieces of data from
the different sources. You combine data from single source record or related data elements from many
source records. On the other hand, data transformation also involves purging source data that is not useful
and separating out source records into new combinations. Sorting and merging of data takes place on a
large scale in the data staging area.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

3. Data Loading: Two distinct groups of tasks form the data loading function. When you complete the
design and construction of the data warehouse and go live for the first time, you do the initial loading of
the data into the data warehouse storage. The initial load moves large volumes of data using up
substantial amounts of time. As the data warehouse starts functioning, you continue to extract the changes
to the source data, transform the data revisions, and feed the incremental data revisions on an ongoing
basis.
 Data Storage Component
The data storage for the data warehouse is a separate repository. The operational systems of
your enterprise support the day-to-day operations. These are online transaction processing
applications. The data repositories for the operational systems typically contain only the current
data. Also, these data repositories contain the data structured in highly normalized formats for fast and
efficient processing.
Generally, the database in your data warehouse must be open. Depending on your requirements, you
are likely to use tools from multiple vendors. The data warehouse must be open to different tools. Most of
the data warehouses employ relational database management systems.
Many data warehouses also employ multidimensional database management systems. Data extracted
from the data warehouse storage is aggregated in many ways and the summary data is kept in
the multidimensional databases (MDDBs). Such multidimensional database systems are usually
proprietary products.

 Information Delivery Component

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

In order to provide information to the wide community of datawarehouse users, the information

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

delivery component includes different methods of information delivery. Figure 2-9 shows the different
information delivery methods. Ad hoc reports are predefined reports primarily meant for novice
and casual users. Provision for complex queries, multidimensional (MD) analysis, and statistical
analysis cater to the needs of the business analysts and power users. Information fed into executive
information systems (EIS) is meant for senior executives and high-level managers. Some data
warehouses also provide data to data-mining applications
 Metadata Component
Metadata in a data warehouse is similar to the data dictionary or the data catalog in a
database management system. In the data dictionary, you keep the information about the logical data
structures, the information about the files and addresses, the information about the indexes, and so
on. The data dictionary contains data about the data in the database. Similarly, the metadata
component is the data about the data in the data warehouse.

Management and Control Component


This component of the data warehouse architecture sits on top of all the other components. The
management and control component coordinates the services and activities within the data
warehouse. This component controls the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the information delivery to the users. It works
with the database management systems and enables data to be properly stored in the repositories. It
monitors the movement of data into the staging area and from there into the data warehouse storage itself.
The management and control component interacts with the metadata component to perform the
management and control functions. As the metadata component contains information about the data
warehouse itself, the metadata is the source of information for the management module.

Types of Metadata
Metadata in a data warehouse fall into three major categories:
† Operational metadata
† Extraction and transformation metadata
† End-user metadata

1.Operational Metadata As you know, data for the data warehouse comes from several
operational systems of the enterprise. These source systems contain different data structures. The
data elements selected for the data warehouse have various field lengths and data types. In
selecting data from the source systems for the data warehouse, you split records, combine parts of
records from different source files, and deal with multiple coding schemes and field lengths. When you

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

deliver information to the end- users, you must be able to tie that back to the original source data sets.
Operational metadata contain all of this information about the operational data sources. Extraction
and Transformation Metadata Extraction and transformation metadata contain data about the extraction
of data from the source systems, namely, the extraction frequencies, extraction methods, and business
rules for the data extraction. Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.

2. End-User Metadata The end-user metadata is the navigational map of the data warehouse. It enables
the end-users to find information from the data warehouse. The end-user metadata allows the end-users to
use their own business terminology and look for information in those ways in which they normally think
of the business
1.3 Operational database Vs Data warehouse

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

1.4 Data warehouse Architecture

1.Centralized Data Warehouse


This architectural type takes into account the enterprise-level information requirements. An
overall infrastructure is established. Atomic level normalized data at the lowest level of granularity
is stored in the third normal form. Occasionally, some summarized data is included. Queries
and applications access the normalized data in the central data warehouse. There are no separate data
marts.

2.Independent Data Marts


This architectural type evolves in companies where the organizational units develop their
own data marts for their own specific purposes. Although each data mart serves the particular
organizational unit, these separate data marts do not provide “a single version of the truth.” The
data marts are independent of one another. As a result, these different data marts are likely to have
inconsistent data definitions and standards. Such variances hinder analysis of data across data marts. For
example, if there are two independent data marts, one for sales and the other for shipments, although sales
and shipments are related subjects, the independent data marts would make it difficult to analyze sales
and shipments data together.
SubCode:CCS341 Subject Name :Data Warehousing
Department of Computer Science and Business Systems

3.Federated
Some companies get into data warehousing with an existing legacy of an assortment of decision-
support structures in the form of operational systems, extracted datasets, primitive data marts, and so on.
For such companies, it may not be prudent to discard all that huge investment and start from scratch. The
practical solution is a federated architectural type where data may be physically or logically integrated
through shared key fields, overall global metadata, distributed queries, and such other methods. In this
architectural type, there is no one overall data warehouse.

4.Hub-and-Spoke
This is the Inmon Corporate Information Factory approach. Similar to the centralized data
warehouse architecture, here too is an overall enterprise-wide data warehouse. Atomic data in the third
normal form is stored in the centralized data warehouse. The major and useful difference is the presence
of dependent data marts in this architectural type. Dependent data marts obtain data from the centralized
data warehouse. The centralized data warehouse forms the hub to feed data to the data marts on
the spokes. The dependent data marts may be developed for a variety of purposes: departmental
analytical needs, specialized queries, data mining, and so on. Each dependent dart mart may have
normalized, denormalized, summarized, or dimensional data structures based on individual
requirements. Most queries are directed to the dependent data marts although the centralized data
warehouse may itself be used for querying. This architectural type results from adopting a top-down
approach to data warehouse development.

5.Data-Mart Bus
This is the Kimbal conformed supermarts approach. You begin with analyzing requirements for a
specific business subject such as orders, shipments, billings, insurance claims, car rentals, and so on. You
build the first data mart (supermart) using business dimensions and metrics. These business dimensions
will be shared in the future data marts. The principal notion is that by conforming dimensions among the
various data marts, the result would be logically integrated supermarts that will provide an
enterprise view of the data. The data marts contain atomic data organized as a dimensional data
model. This architectural type results from adopting an enhanced bottom-up approach to datawarehouse
development.

1.5 Three-tier Data Warehouse Architecture


Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a
data warehouse system, as shown in figure.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Although it is typically called two-layer architecture to highlight a separation between physically


available sources and data warehouses, in fact, consists of four subsequent data flow stages:

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially
to corporate relational databases or legacy databases, or it may come from an information system outside
the corporate walls.

Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and
fill gaps, and integrated to merge heterogeneous sources into one standard schema. The sonamed

Extraction, Transformation, and Loading Tools (ETL) can combine heterogeneous schemata, extract,
transform, cleanse, validate, filter, and load source data into a data warehouse.

Data Warehouse layer: Information is saved to one logically centralized individual repository: a
data warehouse. The data warehouses can be directly accessed, but they can also be used as a
source for creating data marts, which partially replicate data warehouse contents and are designed
for specific enterprise departments. Meta-data repositories store information on sources, access
procedures, data staging, users, data mart schema, and so on.

Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically
analyze information, and simulate hypothetical business scenarios. It should featureaggregate information
navigators, complex query optimizers, and customer-friendly GUIs.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source systems), the
reconciled layer, and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a
whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of the data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that cannot
be satisfactorily prepared using the corporate applications or generating data flows to feed external
processes periodically to benefit from cleaning and integration.
This architecture is especially useful for extensive, enterprise-wide systems. A disadvantage
of this structure is the extra file storage space used through the extra redundant reconciled layer.
It also makes the analytical tools a little further away from being real-time.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

4-Tier Architecture
User:At the end-user layer, data in the ODS, data warehouse, and data marts can be accessed by
using a variety of tools such as query and reporting tools, data visualization tools, and analytical
applications.
Presentation layer: Its functions contain receiving data inputted, interpreting users' instructions,
and sending requests to the data services layer, and displaying the data obtained from the data services
layer to users by the way they can understand. It closest to users and provide an interactive
operation interface.
Business logic: It is located between the PL and data access layer playing a connecting role in the
data exchange. The layer’s concerns are focused primarily on the development of business rules, business
processes, and business needs related to the system.It’s also known as the domain layer.
Data Access: It is located in the innermost layer that implements persistence logic
and responsible for access to the database. Operations on the data contain finding, adding,
deleting, modifying, etc.
This level works independently, without relying on other layers.DAL extracts the appropriate data
from the database and passes the data to the upper.

Principles of Data Warehousing:


1. Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time windows;
performance on the load process should be measured in hundreds of millions of rows and gigabytes per
hour and must not artificially constrain the volume of data business.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

2. Load Processing
Many phases must be taken to load new or update data into the data warehouse, including
data conversion, filtering, reformatting, indexing, and metadata update.
3. Data Quality Management
Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.
4. Query Performance
Fact-based management must not be slowed by the performance of the data warehouse
RDBMS;
large, complex queries must be complete in seconds, not days.
5.Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds
of gigabytes and terabyte-sized data warehouses.

Properties of Data Warehouse Architectures


1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which
has to be managed and processed, and the number of user's requirements, which have to be met,
progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies without
redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
5. Administerability: Data Warehouse management should not be complicated.

1.6 Autonomous Data Warehouse


Autonomous Data Warehouse (ADW) is a fully managed, cloud-based data warehousing solution
provided by Oracle Cloud. It leverages artificial intelligence and machine learning to automate various
administrative tasks, making it self-driving, self-securing, and self-repairing. ADW is designed to handle
large volumes of data and deliver high-performance analytics for business intelligence and data-
driven decision-making.
Top 10 use cases of Oracle Autonomous Data Warehouse:
1. Data Warehousing: Storing and querying large datasets for analytical and reporting purposes.
2. Business Intelligence (BI): Building and delivering interactive dashboards and reports for data-
driven decision-making.
3. Data Analytics: Running complex analytical queries on large volumes of data.
4. Data Integration: Integrating and consolidating data from different sources for analysis.
SubCode:CCS341 Subject Name :Data Warehousing
Department of Computer Science and Business Systems

5. Real-time Analytics: Combining real-time data streams with ADW for near real-time analytics.
6. Customer Analytics: Analyzing customer data to understand behavior and preferences.
7. Predictive Analytics: Building and training predictive models for forecasting and data-driven
insights.
8. Financial Analytics: Analyzing financial data for budgeting, forecasting, and performance
analysis.
9. IoT Data Analysis: Analyzing data from Internet of Things (IoT) devices to derive insights.
10. Compliance and Reporting: Storing historical data for compliance and reporting purposes.
What are the feature of Oracle Autonomous Data Warehouse?
 Self-Driving: Automated database tuning and optimization for better performance and reduced
manual tasks.
 Self-Securing: Automated security measures to protect data and prevent unauthorized access.
 Self-Repairing: Automatic error detection and resolution to ensure high availability.
 Scalability: ADW can scale compute and storage resources independently to match workload
demands.
 In-Memory Processing: Utilizes in-memory columnar processing for faster query performance.
 Parallel Execution: Queries are processed in parallel across multiple nodes for faster results.
 Integration with Oracle Ecosystem: Seamless integration with other Oracle Cloud services and
tools.
 Data Encryption: Provides data encryption both at rest and in transit for data security.
 Easy Data Loading: Supports data loading from various sources, including Oracle Data Pump,
SQL Developer, and SQL*Loader.
 Pay-as-You-Go Pricing: Based on consumption, offering cost-effective pricing.

How Oracle Autonomous Data Warehouse works and Architecture?

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Oracle Autonomous Data Warehouse is built on Oracle Exadata, which is a highly optimized platform
for data warehousing and analytics.
1. Storage Layer: Data is stored in Exadata storage servers using a combination of flash and disk
storage.
2. Compute Layer: The compute nodes are responsible for processing queries and analyzing data.
ADW uses a massively parallel processing (MPP) architecture to parallelize queries across
multiple nodes for faster performance.
3. Autonomous Features: ADW leverages AI and machine learning to automate various
administrative tasks, including performance tuning, security patching, backups, and
fault detection.

How to Install Oracle Autonomous Data Warehouse?


To use Oracle Autonomous Data Warehouse:
1. Sign up for Oracle Cloud: Go to the Oracle Cloud website and sign up for an Oracle Cloud
account

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

2. Provision Autonomous Data Warehouse: In the Oracle Cloud Console, provision an


Autonomous Data Warehouse instance.
3. Connect to Autonomous Data Warehouse: Use SQL clients or tools to connect to ADW and
run SQL queries.
4. Load Data into ADW: Load your data into ADW from various sources using Oracle Data Pump,
SQL Developer, or other data loading tools.
5. Run Queries and Analyze Data: Write SQL queries to analyze your data and gain insights.
6. Monitor Performance: Use the Oracle Cloud Console to monitor query performance and
resource utilization.

1.7 Autonomous Data Warehouse Vs Snowflake


What Is Snowflake?
Snowflake is a Data Warehouse built for the cloud. It centralizes data from multiple
sources, enabling you to run in-depth business insights that power your teams.
At its core, Snowflake is designed to handle structured and semi-structured data from
various sources, allowing organizations to integrate and analyze data from diverse systems seamlessly. Its
unique architecture separates compute and storage, enabling users to scale each independently
based on their specific needs. This elasticity ensures optimal resource allocation and cost-efficiency, as
users only pay for the actual compute and storage utilized.
Snowflake uses a SQL-based query language, making it accessible to data analysts and
SQL developers. Its intuitive interface and user-friendly features allow for efficient data
exploration, transformation, and analysis. Additionally, Snowflake provides robust security and
compliance features, ensuring data privacy and protection.
One of Snowflake’s notable strengths is its ability to handle large-scale, concurrent
workloads without performance degradation. Its auto-scaling capabilities automatically adjust resources
based on the workload demands, eliminating the need for manual tuning and optimization.
Another key advantage of Snowflake is its native integration with popular data processing
and analytics tools, such as Apache Spark, Python, and R. This compatibility enables seamless
data integration, data engineering, and advanced analytics workflows.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Snowflake Data Cloud: Snowflake’s unique multi-cluster shared data architecture separates
compute and storage resources, enabling independent scaling of each component. This decoupled
architecture allows organizations to allocate resources based on their needs and budget, providing greater
flexibility and cost control. Additionally, Snowflake’s architecture supports near-infinite
concurrency, enabling multiple users and applications to access the same data simultaneously.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Main Differences

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Technical Differences

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

1.8 Modern Data Warehouse


A Modern Data Warehouse is a cloud-based solution that gathers and stores that
information. Organizations can process this data to make intelligent decisions. That’s why various
organizations use a Modern Data Warehouse to improve their finances, human resources, and operations
business processes. Quality cloud-based warehouse departments need this information to make smarter
decisions.

There are five different components of a Modern Data Warehouse.


Level 1: Data Acquisition
Data acquisition can come from a variety of sources such as:
 IoT devices
 Social media posts
 YouTube videos
 Website content
 Customer data
 Enterprise Resource Planning
 Legacy data stores

Level 2: Data Engineering


Once you acquired it, you need to upload it into the data warehouse. Data engineering
uses pipelines and ETL (extract, transform, load) tools. Using these different tools, you can
upload that information to a data warehouse similar to a factory. Data engineering is similar to a truck
bringing raw materials into a factory.

Level 3: Data Management Governance


Once the data comes into the factory, you need someone to evaluate the quality of the data. You
then need to steward that data because security and privacy must be considered.
Data governance helps ensure the quality of the info by stewarding, prepping, and cleaning
the data to ensure it is ready for analysis.

Level 4: Reporting and Business Intelligence


Once you prep and clean the data, you can start using factory analysis to take that raw
material(data) and turn it into a finished good (business intelligence). For our purposes, we will
use Microsoft Power BI to help you visualize the information by using advanced analytics, KPIs,
and workflow automation. When you are finished, you can see exactly what’s going on with your data.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Level 5: Data Science


Modern Data Warehouse is about more than seeing the information; it’s about using the data to make
smarter decisions. That’s one of the key concepts you should walk away with here today. There
are several different programs to help you leverage the data to your benefit, including:
 AI
 Deep learning
 Machine learning
 Statistical modeling
 Natural language processing (NLP)

Benefits of a modern data warehouse


By leveraging the flexibility and scalability of the cloud, organizations can enjoy greater
flexibility and larger workloads without sacrificing time and money maintaining physical, on-
premises data centers.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Migrating to a modern data warehouse


Modern data warehouses are just as easy — if not easier — to migrate to as they are to
use. However, to choose the right solution, you’ll need to consider your needs, goals and processes, as
well as select the right architectures and integrations.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

UNIT II ETL AND OLAP TECHNOLOGY 6


What is ETL – ETL Vs ELT – Types of Data warehouses - Data warehouse Design and
Modeling - Delivery Process - Online Analytical Processing (OLAP) - Characteristics of OLAP - Online
Transaction Processing (OLTP) Vs OLAP - OLAP operations- Types of OLAP- ROLAP Vs MOLAP Vs
HOLAP

2.1 What is ETL


ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract data
from various sources, transform it into a format suitable for loading into a data warehouse, and then load
it into the warehouse. The process of ETL can be broken down into the following three stages:

1. Extract: The first stage in the ETL process is to extract data from various sources such
as transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.
2. Transform: In this stage, the extracted data is transformed into a format that is suitable
for loading into the data warehouse. This may involve cleaning and validating the data,
converting data types, combining data from multiple sources, and creating new data fields.
3. Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the warehouse.

The ETL process is an iterative process that is repeated as new data is added to the warehouse. The
process is important because it ensures that the data in the data warehouse is accurate, complete, and up-
to-date. It also helps to ensure that the data is in the format required for data mining and reporting.

Additionally, there are many different ETL tools and technologies available, such as
Informatica, Talend, DataStage, and others, that can automate and simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process
in which an ETL tool extracts the data from various data source systems, transforms it in the staging area,
and then finally, loads it into the Data Warehouse system.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

Let us understand each step of the ETL process in-depth:


1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is
extracted which can be in various formats like relational databases, No SQL, XML, and flat files
into the staging area. It is important to extract the data from various source systems and store it
into the staging area first and not directly into the data warehouse because the extracted data is in
various formats and can be corrupted also. Hence loading it directly into the data warehouse may
damage it and rollback will be much more difficult. Therefore, this is one of the most important
steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are
applied on the extracted data to convert it into a single standard format. It may involve following
processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values, mapping U.S.A, United
States, and America into USA, etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally
loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular intervals. The rate and period of
loading solely depends on the requirements and varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it
can transformed and during that period some new data can be extracted. And while the transformed
data is being loaded into the data warehouse, the already extracted data can be transformed.
The block diagram of the pipelining of ETL process is shown below:

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse
builder, CloverETL, and MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift, BigQuery, and
Firebolt.

Advantages of ETL process in data warehousing:


1. Improved data quality: ETL process ensures that the data in the data warehouse is
accurate, complete, and up-to-date.
2. Better data integration: ETL process helps to integrate data from multiple sources and systems,
making it more accessible and usable.
3. Increased data security: ETL process can help to improve data security by controlling access to
the data warehouse and ensuring that only authorized users can access the data.
4. Improved scalability: ETL process can help to improve scalability by providing a way to
manage and analyze large amounts of data.
5. Increased automation: ETL tools and technologies can automate and simplify the ETL process,
reducing the time and effort required to load and update data in the warehouse.
Disadvantages of ETL process in data warehousing:
1. High cost: ETL process can be expensive to implement and maintain, especially for organizations
with limited resources.
2. Complexity: ETL process can be complex and difficult to implement, especially for organizations
that lack the necessary expertise or resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be able to
handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability, as it may not be able to
handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data privacy, as large amounts of
data are collected, stored, and analyzed.

Overall, ETL process is an essential process in data warehousing that helps to ensure that the data
in the data warehouse is accurate, complete, and up-to-date. However, it also comes with its own set
of challenges and limitations, and organizations need to carefully consider the costs and
benefits before implementing them.

SubCode:CCS341 Subject Name :Data Warehousing


Department of Computer Science and Business Systems

2.2 ETL Vs ELT

ELT ETL

ELT tools do not require additional


hardware ETL tools require specific hardware with
their own engines to perform
transformations

Mostly Hadoop or NoSQL database to RDBMS is used exclusively to store data


store data. Rarely RDBMS is used

As all components are in one system, As ETL uses staging area, extra time is
loading is done only once required to load the data

The system has to wait for large sizes of


Time to transform data is independent
data. As the size of data increases,
of the size of data
transformation time also increases

It is cost effective and available to all Not cost effective for small and medium
business using SaaS solution business

The data transformed is used by data The data transformed is used by users
scientists and advanced analysts reading report and SQL coders

Creates ad hoc views.Low cost for Views are created based on multiple
building and maintaining scripts.Deleting view means deleting data

Best for unstructured and non-relational


Best for relational and structured data.
data. Ideal for data lakes. Suited for very
Better for small to medium amounts of data
large amounts of data

SubCode:CCS341 Subject Name :Data Warehousing


2.3Types of Data Warehouses
There are three main types of data warehouse.
Enterprise Data Warehouse (EDW)
This type of warehouse serves as a key or central database that facilitates decision-support
services throughout the enterprise. The advantage to this type of warehouse is that it provides access to
cross-organizational information, offers a unified approach to data representation, and allows
running complex queries.
Operational Data Store (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for routine activities like
storing employee records. It is required when data warehouse systems do not support reporting needs of
the business.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region,
or business unit. Every department of a business has a central repository or data mart to store data. The
data from the data mart is stored in the ODS periodically. The ODS then sends the data to the EDW,
where it is stored and used.

There are different types of data warehouses, which are as follows:

1.Host-Based Data Warehouses


There are two types of host-based data warehouses which can be implemented:
o Host-Based mainframe warehouses which reside on a high volume database. Supported by robust
and reliable high capacity structure such as IBM system/390, UNISYS and Data General sequent
systems, and databases such as Sybase, Oracle, Informix, and DB2.
o Host-Based LAN data warehouses, where data delivery can be handled either centrally or
from the workgroup environment. The size of the data warehouses of the database depends
on the platform.

Data Extraction and transformation tools allow the automated extraction and cleaning of data
from production systems. It is not applicable to enable direct access by query tools to these categories
of methods for the following reasons:
1. A huge load of complex warehousing queries would possibly have too much of a harmful impact
upon the mission-critical transaction processing (TP)-oriented application.
2. These TP systems have been developing in their database design for transaction throughput. In all
methods, a database is designed for optimal query or transaction processing. A complex business
query needed the joining of many normalized tables, and as result performance will usually
be poor and the query constructs largely complex.
3. There is no assurance that data in two or more production methods will be consistent.

Host-Based (MVS) Data Warehouses


Those data warehouse uses that reside on large volume databases on MVS are the host-
based types of data warehouses. Often the DBMS is DB2 with a huge variety of original source for
legacy information, including VSAM, DB2, flat files, and Information Management System (IMS).
Before embarking on designing, building and implementing such a warehouse, some
further considerations must be given because
1. Such databases generally have very high volumes of data storage.
2. Such warehouses may require support for both MVS and customer-based report and query
facilities.
3. These warehouses have complicated source systems.
4. Such systems needed continuous maintenance since these must also be used for mission-critical
objectives.
To make such data warehouses building successful, the following phases are generally followed:
1. Unload Phase: It contains selecting and scrubbing the operation data.
2. Transform Phase: For translating it into an appropriate form and describing the rules for
accessing and storing it.
3. Load Phase: For moving the record directly into DB2 tables or a particular file for moving it into
another database or non-MVS warehouse.
An integrated Metadata repository is central to any data warehouse environment. Such a facility is
required for documenting data sources, data translation rules, and user areas to the warehouse.
It provides a dynamic network between the multiple data source databases and the DB2 of the
conditional data warehouses.
A metadata repository is necessary to design, build, and maintain data warehouse
processes. It should be capable of providing data as to what data exists in both the operational
system and data warehouse, where the data is located.
The mapping of the operational data to the warehouse fields and end-user access
techniques. Query, reporting, and maintenance are another indispensable method of such a data
warehouse. An MVS-based query and reporting tool for DB2.

Host-Based (UNIX) Data Warehouses


Oracle and Informix RDBMSs support the facilities for such data warehouses. Both of
these databases can extract information from MVS¬ based databases as well as a higher number of
other UNIX¬ based databases. These types of warehouses follow the same stage as the host-
based MVS data warehouses. Also, the data from different network servers can be created.
Since file attribute consistency is frequent across the inter-network.

2.LAN-Based Workgroup Data Warehouses


A LAN based workgroup warehouse is an integrated structure for building and maintaining a data
warehouse in a LAN environment. In this warehouse, we can extract information from a variety of
sources and support multiple LAN based warehouses, generally chosen warehouse databases to
include DB2 family, Oracle, Sybase, and Informix. Other databases that can also be contained
through infrequently are IMS, VSAM, Flat File, MVS, and VH.

Designed for the workgroup environment, a LAN based workgroup warehouse is optimal for any
business organization that wants to build a data warehouse often called a data mart. This type of data
warehouse generally requires a minimal initial investment and technical training.
Data Delivery: With a LAN based workgroup warehouse, customer needs minimal technical
knowledge to create and maintain a store of data that customized for use at the department, business
unit, or workgroup level. A LAN based workgroup warehouse ensures the delivery of
information from corporate resources by providing transport access to the data in the warehouse.

3.Host-Based Single Stage (LAN) Data Warehouses


Within a LAN based data warehouse, data delivery can be handled either centrally or from
the workgroup environment so business groups can meet process their data needed without
burdening centralized IT resources, enjoying the autonomy of their data mart without comprising
overall data integrity and security in the enterprise.
Limitations
Both DBMS and hardware scalability methods generally limit LAN based warehousing solutions.
Many LAN based enterprises have not implemented adequate job scheduling, recovery management,
organized maintenance, and performance monitoring methods to provide robust
warehousing solutions.
Often these warehouses are dependent on other platforms for source record. Building an
environment that has data integrity, recoverability, and security require careful design, planning, and
implementation. Otherwise, synchronization of transformation and loads from sources to the server
could cause innumerable problems.

A LAN based warehouse provides data from many sources requiring a minimal
initial investment and technical knowledge. A LAN based warehouse can also work replication
tools for populating and updating the data warehouse. This type of warehouse can include
business views, histories, aggregation, versions in, and heterogeneous source support, such as
o DB2 Family
o IMS, VSAM, Flat File [MVS and VM]

A single store frequently drives a LAN based warehouse and provides existing DSS applications,
enabling the business user to locate data in their data warehouse. The LAN based warehouse
can support business users with complete data to information solution. The LAN based
warehouse can also share metadata with the ability to catalog business data and make it
feasible for anyone who needs it.

4.Multi-Stage Data Warehouses


It refers to multiple stages in transforming methods for analyzing data through aggregations. In
other words, staging of the data multiple times before the loading operation into the data warehouse,
data gets extracted form source systems to staging area first, then gets loaded to data warehouse after
the change and then finally to departmentalized data marts.
This configuration is well suitable to environments where end-clients in numerous
capacities require access to both summarized information for up to the minute tactical
decisions as well as summarized, a commutative record for long-term strategic decisions. Both the
Operational Data Store (ODS) and the data warehouse may reside on host-based or LAN Based
databases, depending on volume and custom requirements. These contain DB2, Oracle, Informix,
IMS, Flat Files, and Sybase. Usually, the ODS stores only the most up-to-date records. The
data warehouse stores the historical calculation of the files. At first, the information in both
databases will be very similar. For example, the records for a new client will look the same. As
changes to the user record occur, the ODs will be refreshed to reflect only the most current data,
whereas the data warehouse will contain both the historical data and the new information
Thus the volume requirement of the data warehouse will exceed the volume requirements of the
ODS overtime. It is not familiar to reach a ratio of 4 to 1 in practice.

5.Stationary Data Warehouses


In this type of data warehouses, the data is not changed from the sources, as shown in fig:
Instead, the customer is given direct access to the data. For many organizations, infrequent access,
volume issues, or corporate necessities dictate such as approach. This schema does generate several
problems for the customer such as
o Identifying the location of the information for the users
o Providing clients the ability to query different DBMSs as is they were all a single DBMS with a
single API.
o Impacting performance since the customer will be competing with the production data stores.
Such a warehouse will need highly specialized and sophisticated 'middleware' possibly with a single
interaction with the client. This may also be essential for a facility to display the extracted record for
the user before report generation. An integrated metadata repository becomes an absolute
essential under this environment.

6.Distributed Data Warehouses


The concept of a distributed data warehouse suggests that there are two types of distributed data
warehouses and their modifications for the local enterprise warehouses which are distributed
throughout the enterprise and a global warehouses as shown in fig:

Characteristics of Local data warehouses


o Activity appears at the local level
o Bulk of the operational processing
o Local site is autonomous
o Each local data warehouse has its unique architecture and contents of data
o The data is unique and of prime essential to that locality only
o Majority of the record is local and not replicated
o Any intersection of data between local data warehouses is circumstantial
o Local warehouse serves different technical communities
o The scope of the local data warehouses is finite to the local site
o Local warehouses also include historical data and are integrated only within the local site.

7.Virtual Data Warehouses


Virtual Data Warehouses is created in the following stages:
1. Installing a set of data approach, data dictionary, and process management facilities.
2. Training end-clients.
3. Monitoring how DW facilities will be used
4. Based upon actual usage, physically Data Warehouse is created to provide the high-
frequency results

This strategy defines that end users are allowed to get at operational databases directly
using whatever tools are implemented to the data access network. This method provides ultimate
flexibility as well as the minimum amount of redundant information that must be loaded and
maintained. The data warehouse is a great idea, but it is difficult to build and requires investment.
Why not use a cheap and fast method by eliminating the transformation phase of repositories for
metadata and another database. This method is termed the 'virtual data warehouse.'

To accomplish this, there is a need to define four kinds of data:


1. A data dictionary including the definitions of the various databases.
2. A description of the relationship between the data components.
3. The description of the method user will interface with the system.
4. The algorithms and business rules that describe what to do and how to do it.

Disadvantages
1. Since queries compete with production record transactions, performance can be degraded.
2. There is no metadata, no summary record, or no individual DSS (Decision Support System)
integration or history. All queries must be copied, causing an additional burden on the system.
3. There is no refreshing process, causing the queries to be very complex.
2.4 Data warehouse Design and Modeling
Data Warehouse Design
A data warehouse is a single data repository where a record from multiple data sources is
integrated for online business analytical processing (OLAP). This implies a data warehouse needs to meet
the requirements from all the business stages within the entire organization. Thus, data warehouse design
is a hugely complex, lengthy, and hence error-prone process. Furthermore, business analytical functions
change over time, which results in changes in the requirements for the systems. Therefore, data
warehouse and OLAP systems are dynamic, and the design process is continuous.
Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such as answering
management related queries. The target of the design becomes how the record from multiple data sources
should be extracted, transformed, and loaded (ETL) to be organized in a database as the data warehouse.
There are two approaches
1. "top-down" approach
2. "bottom-up" approach

Top-down Design Approach


In the "Top-Down" design approach, a data warehouse is described as a subject-oriented, time-
variant, non-volatile and integrated data repository for the entire enterprise data from different sources are
validated, reformatted and saved in a normalized (up to 3NF) database as the data warehouse. The data
warehouse stores "atomic" information, the data at the lowest level of granularity, from
where dimensional data marts can be built by selecting the data required for specific business
subjects or particular departments. An approach is a data-driven approach as the information is
gathered and integrated first and then business requirements by subjects for building data marts are
formulated. The advantage of this method is which it supports a single integrated data source. Thus data
marts built from it will have consistency when they overlap.
Advantages of top-down design
Data Marts are loaded from the data warehouses.
Developing new data mart from the data warehouse is very easy.
Disadvantages of top-down design
This technique is inflexible to changing departmental needs.
The cost of implementing the project is high.

Bottom-Up Design Approach


In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction
data specifically architecture for query and analysis," term the star schema. In this approach, a data
mart is
created first to necessary reporting and analytical capabilities for particular business processes (or
subjects). Thus it is needed to be a business-driven approach in contrast to Inmon's data-driven approach.
Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a
normalized database for the data warehouse, a denormalized dimensional database is adapted to meet the
data delivery requirements of data warehouses. Using this method, to use the set of data marts as
the enterprise data warehouse, data marts should be built with conformed dimensions in mind, defining
that ordinary objects are represented the same in different data marts. The conformed dimensions
connected the data marts to form a data warehouse, which is generally called a virtual data warehouse.

The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart, a
data warehouse for a single subject, takes far less time and effort than developing an enterprise-wide data
warehouse. Also, the risk of failure is even less. This method is inherently incremental. This method
allows the project team to learn and grow.

Advantages of bottom-up design


Documents can be generated quickly.
The data warehouse can be extended to accommodate new business units.
It is just developing new data marts and then integrating with other data marts.
Disadvantages of bottom-up design
The locations of the data warehouse and the data marts are reversed in the bottom-up approach
design.

Differentiate between Top-Down Design Approach and Bottom-Up Design Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into smaller sub Solves the essential low-level problem and
problems. integrates them into a higher one.

Inherently architected- not a union of several data Inherently incremental; can schedule essential
marts. data marts first.

Single, central storage of information about the Departmental information stored.


content.

Centralized rules and control. Departmental rules and control.

It includes redundant information. Redundancy can be removed.


It may see quick results if implemented with Less risk of failure, favorable return on
repetitions. investment, and proof of techniques.

8 steps to data warehouse design:


1. Gather Requirements: Aligning the business goals and needs of different departments with the
overall data warehouse project.
2. Set Up Environments: This step is about creating three environments for data warehouse
development, testing, and production, each running on separate servers
3. Data Modeling: Design the data warehouse schema, including the fact tables and
dimension tables, to support the business requirements.
4. Develop Your ETL Process: ETL stands for Extract, Transform, and Load. This process is how
data gets moved from its source into your warehouse.
5. OLAP Cube Design: Design OLAP cubes to support analysis and reporting requirements.
6. Reporting & Analysis: Developing and deploying the reporting and analytics tools that will be
used to extract insights and knowledge from the data warehouse.
7. Optimize Queries: Optimizing queries ensures that the system can handle large amounts of data
and respond quickly to queries.
8. Establish a Rollout Plan: Determine how the data warehouse will be introduced to the
organization, which groups or individuals will have access to it, and how the data will be
presented to these users.

1. Defining Business Requirements (or Requirements Gathering)

Data warehouse design is a business-wide journey. Data warehouses touch all areas of
your business, so every department needs to be on board with the design. Since your warehouse is
only as powerful as the data it contains, aligning departmental needs and goals with the overall project is
critical to your success.

So, if you currently can't combine all your sales data with all your marketing data, your overall
query results are missing some critical components. Knowing which leads are valuable can help you get
more value from your marketing data.

Every department needs to understand the purpose of the data warehouse, how it benefits them,
and what kinds of results they can expect from your warehousing solution.

This Requirements Gathering stage should focus on the following objectives:

 Aligning departmental goals with the overall project

 Determining the scope of the project in relation to business processes

 Discovering your current and future needs by diving deep into your data (finding out what data is
useful for analysis) and your current tech stack (where your data is currently siloed and not being
used)

 Creating a disaster recovery plan in the case of system failure


 Thinking about each layer of security (e.g., threat detection, threat mitigation, identity
controls, monitoring, risk reduction, etc.)
 Anticipating compliance needs and mitigating regulatory risks

You can think of this as your overall data warehouse blueprint. But this phase is more about determining
your business needs, aligning those to your data warehouse, and, most importantly, getting everyone on
board with the data warehousing solution.

2. Setting Up Your Physical Environments

Data warehouses typically have three primary physical environments — development, testing, and
production. This mimics standard software development best practices, and your three environments exist
on completely separate physical servers.
Why do you need three separate environments?

 You need a way to test changes before they move into the production environment.

 Some security best practices require that testers and developers never have access to production
data.

 Running tests against data typically uses extreme data sets or random sets of data from
the production environment — and you need a unique server to execute these tests en masse.

 Having a development environment is a necessity, and dev environments exist in a unique state of
flux compared to production or test environments.

 Production environments have much higher workloads (your whole business is using it), so trying
to run tests or develop in that environment can be stressful for both team members and servers.

 Data integrity is much easier to track, and issues are easier to contain when you have
three environments running. It makes headhunting issues less stressful on your workloads,
and data flow in production and testing environments can be stalled without impacting end users.

 Running tests can often introduce breakpoints and hang your entire server. That's not something
you want happening in your production environment.

 Imagine sharing resources between production, testing, and development. You don’t want
that!
Testing, development, and production environments all have different resource needs, and trying
to combine all functions into one server can be catastrophic for performance.

Remember, BI development is an ongoing process that really never grinds to a halt. This is especially true
in Agile/DevOps approaches to the software development lifecycle, which all require
separate environments due to the sheer magnitude of constant changes and adaptations.

You can choose to run more than these three environments, and some business users choose to
add additional environments for specific business needs. Integrate.io has seen staging environments that
are separate from testing solely for quality assurance work, as well as demo and integration
environments specifically for testing integrations.

You should have these three core environments, but you can layer in additional settings to fit your unique
business goals.
3. Data Warehouse Design: Introducing Data Modeling

Data modeling is the process of visualizing data distribution in your warehouse. Think of it as a blueprint.
Before you start building a house, it's important to know what goes where and why it goes there. That's
what data modeling is to data warehouses.

Data modeling helps you:

 Visualize the relationships between data

 Set standardized naming conventions

 Create relationships between data sets

 Establish compliance and security processes

 Align your processes with your overarching IT goals

The above benefits of data modeling help improve decision-making throughout your organization.

However, data modeling is probably the most complex phase of data warehouse design, and there

are multiple data modeling techniques businesses can choose from for warehouse design. Before

jumping into a few of the most popular data modeling techniques, let's take a look at the differences

between data warehouses and data marts:

A data warehouse is a system to store data in (or push data into) to run analytics and queries. A data mart,
on the other hand, is an area within a data warehouse that stores data for a specific business function.

So, say you've built your entire data warehouse. That's great! But does it account for how
different departments will use the data? Your sales team will use that data warehouse in a vastly different
way than your legal team. Plus, certain workflows and data sets are only valuable to certain teams. Data
marts are where all those team-specific data sets are stored, and related queries are processed.

Data modeling typically takes place at the data mart level and branches out into your data warehouse. It's
the logic behind how you store certain data in relation to other data.

The three most popular data models for warehouses are:

1. Snowflake schema

2. Star schema

3. Galaxy schema
You should choose and develop a data model to guide your overall data architecture within your
warehouse. The model you choose will impact the structure of your data warehouse and data marts —
which impacts the ways that you utilize ETL tools like Integrate.io and run queries on that data.

4. Choosing Your Extract, Transform, Load (ETL) Solution

ETL or Extract, Transform, Load is the process used to pull data out of your current tech stack or
existing storage solutions and put it into your warehouse. It goes something like this:

 You extract data from a source system and place it into a staging area.

 You transform that data into the best format for data analytics. You also remove any duplicated
data or inconsistencies that can make analysis difficult.

 You then load the data to a data warehouse before pushing it through BI tools like Tableau and
Looker.

Normally, ETL is a complicated process that requires manual pipeline-building and lots of code. Building
these pipelines can take weeks or even months and might require a data engineering team. That’s where
ETL solutions come in. They automate many tasks associated with this data management and integration
process, freeing up resources for your team.

You should pay careful attention to the ETL solution you use so you can improve business
decisions. Since ETL is responsible for the bulk of the in-between work, choosing a subpar tool or
developing a poor ETL process can break your entire warehouse. You want optimal speeds, high
availability, good visualization, and the ability to build easy, replicable, and consistent data
pipelines between all your existing architecture and your new warehouse.

This is where ETL tools like Integrate.io are valuable. Integrate.io creates hyper-visualized data pipelines
between all your valuable tech architecture while cleaning and nominalizing that data for compliance and
ease of use.

Remember, a good ETL process can mean the difference between a slow, painful-to-use data warehouse
and a simple, functional warehouse that's valuable throughout every layer of your organization.

ETL will likely be the go-to for pulling data from systems into your warehouse. Its counterpart Extract,
Load, Transfer (ELT) negatively impacts the performance of most custom-built warehouses since data is
loaded directly into the warehouse before data organization and cleansing occur. However, there might be
other data integration use cases that suit the ELT process. Integrate.io not only executes ETL but
can handle ELT, Reverse ETL, and Change Data Capture (CDC), as well as provide data observability
and data warehouse insights.
5. Online Analytic Processing (OLAP) Cube

OLAP (Online Analytical Processing) cubes are commonly used in the data warehousing process
to enable faster, more efficient analysis of large amounts of data. OLAP cubes are based
on multidimensional databases that store summarized data and allow users to quickly analyze
information from different dimensions.

Here's how an OLAP cube fits into the data warehouse design:

 OLAP cubes are designed to store pre-aggregated data that has been processed from
various sources in a data warehouse. The data is organized into a multi-dimensional structure that
enables users to view and analyze it from different perspectives.

 OLAP cubes are created using a process called cube processing, which involves aggregating and
storing data in a way that enables fast retrieval and analysis. Cube processing can be performed on
a regular basis to ensure that the data is up-to-date and accurate.

 OLAP cubes enable users to perform complex analytical queries on large volumes of data in real-
time, making it easier to identify trends, patterns, and anomalies. Users can also slice and
dice data in different ways to gain deeper insights into their business operations.

 OLAP cubes support drill-down and roll-up operations, which allow users to navigate
through different levels of data granularity. Users can drill down to the lowest level of
detail to view individual transactions or roll up to higher levels of aggregation to view summary
data.
 OLAP cubes can be accessed using a variety of tools, including spreadsheets, reporting tools, and
business intelligence platforms. Users can create reports and dashboards that display the data in a
way that is meaningful to them.

You'll likely need to address OLAP cubes if you're designing your entire database from scratch,
or if you're maintaining your own OLAP cube — which typically requires specialized personnel.

So, if you plan to use a vendor warehouse solution (e.g., Redshift or BigQuery) you probably won't need
an OLAP cube (cubes are rarely used in either of those solutions*.)

If you have a set of BI tools requiring an OLAP cube for ad-hoc reporting, you may need to develop one
or use a vendor solution.

OLAP Cubes vs. Data Warehouse

Here are the differences between a data warehouse and OLAP cubes:
 A data warehouse is where you store your business data in an easily analyzable format to be used
for a variety of business needs.

 Online Analytic Processing cubes help you analyze the data in your data warehouse or data mart.
Most of the time, OLAP cubes are used for reporting, but they have plenty of other use cases.

Since your data warehouse will have data coming in from multiple data pipelines, OLAP cubes help you
organize all that data in a multi-dimensional format that makes analyzing it rapid and
straightforward. OLAP cubes are a critical component of data warehouse design because they provide
fast and efficient access to large volumes of data, enabling users to make informed business decisions
based on insights derived from the data.

6. Data Warehouse Design: Creating the Front End

So far, this guide has only covered back-end processes. There needs to be front-end visualization,
so users can immediately understand and apply the results of data queries.

That's the job of your front end. There are plenty of tools on the market that help with visualization. BI
tools like Tableau (or PowerBI for those using BigQuery) are great for visualization. You can
also develop a custom solution — though that's a significant undertaking.

Most small-to-medium-sized businesses lean on established BI kits like those mentioned above.
But, some businesses may need to develop their own BI tools to meet ad-hoc analytic needs. For example,
a Sales Ops manager at a large company may need a specific BI tool for territory strategies.
This tool would probably be custom-developed given the scope of the company’s sales objectives.

7. Optimizing Queries

Optimizing queries is a critical part of data warehouse design. One of the primary goals
of building a data warehouse is to provide fast and efficient access to data for decision-making. During
the design process, data architects need to consider the types of queries that users will be running and
design the data warehouse schema and indexing accordingly.

Optimizing your queries is a complex process that's hyper-unique to your specific needs. But there
are some general rules of thumb.

We heavily recommend the following during database design:

 Ensure your production, testing, and development environments have


mirrored resources. This mirroring prevents the server from hanging when you push
projects from one environment to the next.
 Try to minimize data retrieval. Don't run SELECT on the whole database if you only need
a column of results. Instead, run your SELECT query by targeting specific columns. This
is especially important if you're paying for your query power separately.

 Understand the limitations of your OLAP vendor. BigQuery uses a hybrid SQL language, and
RedShift is built on top of a Postgre fork. Knowing the little nuances baked into your vendor can
help you maximize workflows and speed up queries.

8. Establishing a Rollout Plan


Once you're ready to launch your warehouse, it's time to start thinking about education, training,
and use cases. Most of the time, it will be a week or two before your end-users start seeing
any functionality from that warehouse (at least at scale). But they should be adequately trained in
its use before the rollout is completed.
A rollout plan typically includes the following steps:

1. Identifying the target audience: This involves determining which groups or individuals within
the organization will benefit from using the data warehouse.

2. Determining the data requirements: This involves identifying the types of data that the target
audience needs access to and ensuring that this data is available within the data warehouse.

3. Developing user-friendly interfaces: This involves creating user interfaces that are intuitive and
easy to use, and that provide users with the ability to interact with the data in meaningful ways.

4. Testing and refining: This involves conducting user testing to ensure that the data
warehouse meets the needs of its users, and making adjustments as necessary.

5. Training users: This involves providing training and support to users to help them
understand how to use the data warehouse effectively.

6. Deploying the data warehouse: This involves introducing the data warehouse to its
intended users, and ensuring that the rollout process goes smoothly.

By establishing a rollout plan, organizations can ensure that their data warehouse is
introduced effectively and that users are able to make the most of the valuable data that it contains.

2.4.1 Modeling
Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling is to develop
a schema describing the reality, or at least a part of the fact, which the data warehouse is needed to
support.
Data warehouse modeling is an essential stage of building a data warehouse for two main reasons.
Firstly, through the schema, data warehouse clients can visualize the relationships among the warehouse
data, to use them with greater ease. Secondly, a well-designed schema allows an effective data warehouse
structure to emerge, to help decrease the cost of implementing the warehouse and improve the efficiency
of using it.
Data modeling in data warehouses is different from data modeling in operational database
systems. The primary function of data warehouses is to support DSS processes. Thus, the objective of
data warehouse modeling is to make the data warehouse efficiently support complex queries on long term
information.
In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover,
data warehouses are designed for the customer with general information knowledge about the
enterprise, whereas operational database systems are more oriented toward use by software specialists for
creating distinct applications.

The data within the specific warehouse itself has a particular architecture with the emphasis on
various levels of summarization, as shown in figure:
The current detail record is central in importance as it:

o Reflects the most current happenings, which are commonly the most stimulating.
o It is numerous as it is saved at the lowest method of the Granularity.
o It is always (almost) saved on disk storage, which is fast to access but expensive and difficult to
manage.

Older detail data is stored in some form of mass storage, and it is infrequently accessed and kept at a
level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the current,
detailed level and usually is stored on disk storage. When building the data warehouse have to remember
what unit of time is summarization done over and also the components or what attributes the summarized
data will contain. Highly summarized data is compact and directly available and can even be found
outside the warehouse. Metadata is the final element of the data warehouses and is really of various
dimensions in which it is not the same as file drawn from the operational data, but it is used as:-
o A directory to help the DSS investigator locate the items of the data warehouse.
o A guide to the mapping of record as the data is changed from the operational data to the data
warehouse environment.
o A guide to the method used for summarization between the current, accurate data and the lightly
summarized information and the highly summarized data, etc.
Data Modeling Life Cycle
In this section, we define a data modeling life cycle. It is a straight forward process of
transforming the business requirements to fulfill the goals for storing, maintaining, and accessing
the data within IT systems. The result is a logical and physical data model for an enterprise data
warehouse.
The objective of the data modeling life cycle is primarily the creation of a storage area for
business information. That area comes from the logical and physical data modeling stages, as
shown in Figure:
Conceptual Data Model
A conceptual data model recognizes the highest-level relationships between the different entities.
Characteristics of the conceptual data model
o It contains the essential entities and the relationships among them.
o No attribute is specified.
o No primary key is specified.
We can see that the only data shown via the conceptual data model is the entities that define the
data and the relationships between those entities. No other data, as shown through the conceptual
data model.

Logical Data Model


A logical data model defines the information in as much structure as possible, without observing
how they will be physically achieved in the database. The primary objective of logical data modeling is to
document the business data structures, processes, rules, and relationships by a single view - the logical
data model.

Features of a logical data model


o It involves all entities and relationships among them.
o All attributes for each entity are specified.
o The primary key for each entity is stated.
o Referential Integrity is specified (FK Relation).

The phase for designing the logical data model which are as follows:
o Specify primary keys for all entities.
o List the relationships between different entities.
o List all attributes for each entity.
o Normalization.
o No data types are listed

Physical Data Model


Physical data model describes how the model will be presented in the database. A
physical database model demonstrates all table structures, column names, data types, constraints,
primary key, foreign key, and relationships between tables. The purpose of physical data modeling is the
mapping of the logical data model to the physical structures of the RDBMS system hosting the data
warehouse. This contains defining physical RDBMS structures, such as tables and data types to
use when storing the information. It may also include the definition of new data structures for enhancing
query performance.

Characteristics of a physical data model


o Specification all tables and columns.
o Foreign keys are used to recognize relationships between tables.
The steps for physical data model design which are as follows:
o Convert entities to tables.
o Convert relationships to foreign keys.
o Convert attributes to columns.
Data warehouse development life cycle
Data Warehousing is a flow process used to gather and handle structured and unstructured data
from multiple sources into a centralized repository to operate actionable business decisions. With all of
your data in one place, it becomes easier to perform analysis, reporting and discover meaningful insights
at completely different combination levels.
A data warehouse setting includes extraction, transformation, and loading (ELT) resolution,
an online analytical processing (OLAP) engine, consumer analysis tools, and different applications
that manage the method of gathering data and delivering it to business. The term data warehouse life-
cycle is used to indicate the steps a data warehouse system goes through between when it is built. The
following is the Life-cycle of Data Warehousing:
Data Warehouse Life Cycle
 Requirement Specification: It is the first step in the development of the Data Warehouse and is
done by business analysts. In this step, Business Analysts prepare business
requirement specification documents. More than 50% of requirements are collected from the client
side and it takes 3-4 months to collect all the requirements. After the requirements are
gathered, the data modeler starts recognizing the dimensions, facts & combinations based on the
requirements. We can say that this is an overall blueprint of the data warehouse. But, this
phase is more about determining business needs and placing them in the data warehouse.

 Data Modelling: This is the second step in the development of the Data Warehouse. Data
Modelling is the process of visualizing data distribution and designing databases by fulfilling the
requirements to transform the data into a format that can be stored in the data warehouse.
For example, whenever we start building a house, we put all the things in the correct
position as specified in the blueprint. That’s what data modeling is for data warehouses. Data
modelling helps to organize data, creates connections between data sets, and it’s useful for
establishing data compliance and its security that line up with data warehousing goals. It is the
most complex phase of data warehouse development. And, there are many data modelling
techniques that businesses use for warehouse design. Data modelling typically takes place at the
data mart level and branches out in a data warehouse. It’s the logic of how the data is stored
concerning other data. There are three data models for data warehouses:
 Star Schema
 Snowflake Schema
 Galaxy Schema.

 ELT Design and Development: This is the third step in the development of the Data Warehouse.
ETL or Extract, Transfer, Load tool may extract data from various source systems and store it in a
data lake. An ETL process can extract the data from the lake, after that transform it and load it
into a data warehouse for reporting. For optimal speeds, good visualization, and the ability
to build easy, replicable, and consistent data pipelines between all of the existing architecture
and the new data warehouse, we need ELT tools. This is where ETL tools like
SAS Data Management, IBM Information Server, Hive, etc. come into the picture. A good ETL
process can be helpful in constructing a simple yet functional data warehouse that’s valuable
throughout every layer of the organization.

 OLAP Cubes: This is the fourth step in the development of the Data Warehouse. An OLAP cube,
also known as a multidimensional cube or hypercube, is a data structure that allows fast analysis
of data according to the multiple dimensions that define a business problem. A data warehouse
would extract information from multiple data sources and formats like text files, excel
sheets, multimedia files, etc. The extracted data is cleaned and transformed and is loaded into an
OLAP server (or OLAP cube) where information is pre-processed in advance for further
analysis. Usually, data operations and analysis are performed using a simple spreadsheet, where
data values are arranged in row and column format. This is ideal for two-dimensional data.
However, OLAP contains multidimensional data, with data typically obtained from different and
unrelated sources. Employing a spreadsheet isn’t an optimum choice. The cube will
store and analyze multidimensional data in a logical and orderly manner. Now, data warehouses
are now offered as a fully built product that is configurable and capable of staging multiple
types of data. OLAP cubes are becoming outdated as OLAP cubes can’t deliver real-time
analysis and reporting, as businesses are now expecting something with high performance.

 UI Development: This is the fifth step in the development of the Data Warehouse. So far,
the processes discussed have taken place at the backend. There is a need for a user interface for
how the user and a computer system interact, in particular the use of input devices and
software, to immediately access the data warehouse for analysis and generating reports. The main
aim of a UI is to enable a user to effectively manage a device or machine they’re interacting with.
There are plenty of tools in the market that helps with UI development. BI tools like Tableau or
PowerBI for those using BigQuery are great choices.

 Maintenance: This is the sixth step in the development of the Data Warehouse. In this phase, we
can update or make changes to the schema and data warehouse’s application domain or
requirements. Data warehouse maintenance systems must provide means to keep track of schema
modifications as well, for instance, modifications. At the schema level, we can perform operations
for the Insertion, and change dimensions and categories. Changes are, for example, adding
or deleting user-defined attributes.

 Test and Deployment: This is often the ultimate step in the Data Warehouse development cycle.
Businesses and organizations test data warehouses to ensure whether the required business
problems are implemented successfully or not. The warehouse testing involves the scrutiny
of enormous volumes of data. Data that has to be compared comes from heterogeneous data
sources like relational databases, flat files, operational data, etc. The overall data warehouse
project testing phases include: Data completeness, Data Transformation, Data is loaded by means
of ETL tools, Data integrity, etc. After testing the data warehouse, we deployed it so that
users could
immediately access the data and perform analysis. Basically, in this phase, the data warehouse is turned
on and lets the user take the benefit of it. At the time of data warehouse deployment, most of its functions
are implemented. The data warehouses can be deployed at their own data center or on the cloud.

2.5 Data Warehouse Delivery Process


Main steps used in data warehouse delivery process which are as follows:

IT Strategy: Data warehouse project must contain IT strategy for procuring and retaining funding.
Business Case Analysis: After the IT strategy has been designed, the next step is the business case. It is
essential to understand the level of investment that can be justified and to recognize the
projected business benefits which should be derived from using the data warehouse.

Education & Prototyping: Company will experiment with the ideas of data analysis and educate
themselves on the value of the data warehouse. This is valuable and should be required if this is
the company first exposure to the benefits of the DS record. Prototyping method can progress the growth
of education. It is better than working models. Prototyping requires business requirement,
technical blueprint, and structures.

Business Requirement: It contains such as


 The logical model for data within the data warehouse.
 The source system that provides this data (mapping rules)
 The business rules to be applied to information.
 The query profiles for the immediate requirement

Technical blueprint: It arranges the architecture of the warehouse. Technical blueprint of the delivery
process makes an architecture plan which satisfies long-term requirements. It lays server and data mart
architecture and essential components of database design.

Building the vision: It is the phase where the first production deliverable is produced. This stage will
probably create significant infrastructure elements for extracting and loading information but limit them
to the extraction and load of information sources.

History Load: The next step is one where the remainder of the required history is loaded into the data
warehouse. This means that the new entities would not be added to the data warehouse, but additional
physical tables would probably be created to save the increased record volumes.

AD-Hoc Query: In this step, we configure an ad-hoc query tool to operate against the data warehouse.
These end-customer access tools are capable of automatically generating the database query that answers
any question posed by the user.

Automation: The automation phase is where many of the operational management processes are fully
automated within the DWH. These would include:

Extracting & loading the data from a variety of sources systems


 Transforming the information into a form suitable for analysis
 Backing up, restoring & archiving data
 Generating aggregations from predefined definitions within the Data Warehouse.
 Monitoring query profiles & determining the appropriate aggregates to maintain
system performance.

Extending Scope: In this phase, the scope of data warehouse is extended to address a new set
of business requirements. This involves the loading of additional data sources into the data warehouse
i.e. the introduction of new data marts.
Requirement Evolution: This is the last step of the delivery process of a data warehouse. As we
all know that requirements are not static and evolve continuously. As the business requirements will
change it supports to be reflected in the system.

2.6 Online Analytical Processing (OLAP)


OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology
which authorizes analysts, managers, and executives to gain insight into information through fast,
consistent, interactive access in a wide variety of possible views of data that has been transformed from
raw information to reflect the real dimensionality of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-
Models, Knowledge Discovery, and Data Warehouses Reporting.
OLAP enables end-clients to perform ad hoc analysis of record in multiple dimensions, providing the
insight and understanding they require for better decision making.

Who uses OLAP and Why?


OLAP applications are used by a variety of the functions of an organization.
Finance and accounting:
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing
o Sales analysis and forecasting
o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation
Production
o Production planning
o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model more
intuitive to them than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.

How OLAP Works?


Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that
are typically very hard to execute over tabular databases, namely aggregation, joining, and grouping.
These queries are calculated during a process that is usually called 'building' or 'processing' of the OLAP
cube. This process happens overnight, and by the time end users get to work - data will have been
updated.

OLAP Guidelines (Dr.E.F.Codd Rule)


Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines
and requirements as the basis for selecting OLAP systems:

1) Multidimensional Conceptual View: This is the central features of an OLAP system. By needing a
multidimensional view, it is possible to carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing operations,
and the dissimilar nature of source data totally transparent to users. Such transparency helps to improve
the efficiency and productivity of the users.

3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP system must
map its own logical schema to the heterogeneous physical data stores and perform any necessary
transformations. The OLAP operations should be sitting between data sources (e.g., data warehouses) and
an OLAP front- end.

4) Consistent Reporting Performance: To make sure that the users do not feel any significant
degradation in documenting performance as the number of dimensions or the size of the database
increases. That is, the performance of OLAP should not suffer as the number of dimensions is increased.
Users must observe consistent run time, response time, or machine utilization every time a given query is
run.

5) Client/Server Architecture: Make the server component of OLAP tools sufficiently intelligent that
the various clients to be attached with a minimum of effort and integration programming. The
server should be capable of mapping and consolidating data between dissimilar databases.

6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in both
is structure and operational capabilities. Additional operational capabilities may be allowed to
selected dimensions, but such additional tasks should be grantable to any dimension.

7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical
model being created and loaded that optimizes sparse matrix handling. When encountering the sparse
matrix, the system must be easy to dynamically assume the distribution of the information and adjust the
storage and access to obtain and maintain a consistent level of performance.

8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.

9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to identify
dimensional order and necessarily functions roll-up and drill-down methods within a dimension or across
the dimension.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction like
as reorientation (pivoting), drill-down and roll-up, and another manipulation to be accomplished
naturally and precisely via point-and-click and drag and drop methods on the cells of the scientific model.
It avoids the use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns, rows, and
cells in a manner that facilitates simple manipulation, analysis, and synthesis of data.

12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should be
unlimited. Each of these common dimensions must allow a practically unlimited number of
customer- defined aggregation levels within any given consolidation path.
Major advantages to using OLAP
 Business-focused calculations: One of the reasons OLAP systems are so fast is that they
pre- aggregate variables that would otherwise have to be generated on the fly in a traditional
relational database system. The calculation engine is in charge of both data aggregation
and business computations. The analytic abilities of an OLAP system are independent of
how the data is portrayed. The analytic calculations are kept in the system’s metadata rather than
in each report.
 Business-focused multidimensional data: To organize and analyze data, OLAP uses
a multidimensional technique. Data is arranged into dimensions in a multidimensional method,
with each dimension reflecting various aspects of the business. A dimension can be
defined as a characteristic or an attribute of a data set. Elements of each dimension share the
same common trait. Within the dimension, the elements are typically structured hierarchically.

 Trustworthy data and calculations: Data and calculations are centralized in OLAP
systems, guaranteeing that all end users have access to a single source of data. All data is
centralized in a multidimensional database in some OLAP systems. Several others centralize
some data in a multidimensional database and link to data stored relationally. Other OLAP
systems are integrated into a data warehouse and store data in multiple dimensions within the
database.

 Flexible, self-service reporting: Business users can query data and create reports with
OLAP systems using tools that are familiar to them.

 Speed-of-thought analysis: End-user queries are answered faster by OLAP systems than by
relational databases that do not use OLAP technology. OLAP systems pre-aggregate data,
allowing for fast response time.
OLAP queries are usually performed in a separate system, i.e., a data warehouse.
Transferring Data to Data Warehouse:
 Data warehouses aggregate data from a variety of sources.
 Data must be converted into a systematic format.
 In a typical data warehouse project, data integration takes up 80% of the effort.
Optimization of Data Warehouse:
 Data storage can be either relational or multi-dimensional.
 Additional data structures include sorting, indexing, summarizing, and cubes.
 Refreshing of data structures.
Querying Multidimensional data:
 SQL extensions.
 Map-reduce-based languages.
 Multidimensional Expressions (MDX).

2.7 Characteristics of OLAP


 Multidimensional Data Analysis Techniques: Multidimensional evaluations are
inherently consultants of a real enterprise version. OLAP equipment is their capability for
multidimensional evaluation. In multidimensional evaluation, facts are processed and
regarded as a part of a multidimensional structure. This sort of facts evaluation is
especially appealing to enterprise choice makers due to the fact they generally tend to view
enterprise facts as facts that might be associated with different enterprise information.
 Advanced Database Support: For efficient decision support, OLAP tools should have superior
facts get right of entry to functions. Access to many extraordinary forms of DBMSs, flat files, and
internal and outside facts sources.
 Access to aggregated information warehouse facts as well as to the element facts observed
in operational databases.
 Advanced facts navigation functions together with drill-down and roll-up.
 Rapid and regular question reaction times.
 The ability to map end-user requests, expressed in both enterprise or version terms, to the
perfect facts supply after which to the right facts get the right of entry to language
(typically SQL).
 Support for extremely massive databases. As already defined the facts warehouse
can easily and speedily develop to a couple of gigabytes or even terabytes.
 Easy-to-Use End-User Interface: Advanced OLAP functions emerge as extra beneficial
whilst get the right of entry to them is stored simple. OLAP equipment has geared up its state-of-
the-art facts extraction and evaluation equipment with easy-to-use graphical interfaces.
Many of the interface functions are “borrowed” from preceding generations of facts
evaluation equipment which might be already acquainted to stop users. This familiarity
makes OLAP effortlessly familiar and quite simply used.
 Client/Server Architecture: Confirm the device to the principles of Client/server structure
to offer a framework inside which new structures may be designed, developed, and
implemented. The client/server surroundings permit an OLAP device to be divided into numerous
additives that outline its structure. Those additives can then be positioned at the equal computer, or
they may be allotted amongst numerous computers. Thus, OLAP is designed to fulfill ease-of-use
necessities at the same time as retaining the device flexibility.

Motivations for using OLAP


 Understanding and improving sales: For organizations with a large number of products and a
variety of marketing channels, OLAP can assist in identifying the most suitable products and the
most well-known channels. It may be possible to find the most profitable users with some
strategies.
 Understanding and decreasing costs of doing business: One technique of improving
a corporation is to increase sales, and another method is to analyze costs and limit them as much
as possible without impacting sales. The use of OLAP can improve the analysis of sales costs. It
may also be possible to discover expenditures that provide a high return on investment
(ROI) using specific methodologies.

2.8 Online Transaction Processing (OLTP) Vs OLAP


OLTP System
OLTP System handles with operational data. Operational data are those data contained in
the operation of a particular system. Example, ATM transactions and Bank transactions, etc.

OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that
are achieved over a long period. For example, if we collect the last 10 years information about
flight reservation, the data can give us much meaningful data such as the trends in the reservation. This
may provide useful information like peak time of travel, what kind of people are traveling in various
classes (Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP manage many concurrent customers and queries touching only an
individual record or limited groups of files at a time. An OLAP system must have the capability
to operate on millions of files to answer a single query.
.
Feature OLTP OLAP
Characteristic It is a system which is used It is a system which is used to
to manage operational Data. manage informational Data.

Users Clerks, clients, and Knowledge workers, including


information technology managers, executives, and analysts.
professionals.
System orientation OLTP system is a customer- OLAP system is market-oriented,
oriented, transaction, and query knowledge workers including managers,do
processing are done by clerks, data analysts executive and analyst
clients, and information
technology professionals
Data contents OLTP system manages current OLAP system manages a large amount
data that too detailed and are of historical data, provides facilitates
used for decision making for summarization and aggregation, and
stores and manages data at different levels
of granularity. This information makes
the data more comfortable to use in
informed decision making.
Database Size 100 MB-GB 100 GB-TB
Database design OLTP system usually uses an OLAP system typically uses either a star
entity- relationship (ER) data or snowflake model and subject-oriented
model and application-oriented database design.
database design
View OLTP system focuses primarily OLAP system often spans multiple versions
on the current data within an of a database schema, due to the
enterprise or department, without evolutionary process of an organization.
referring to historical information OLAP systems also deal with data that
or data in different originates from various organizations,
organizations. integrating information from many data
stores.
Volume of data Not very large Because of their large volume, OLAP
data are stored on multiple storage media.
Access patterns The access patterns of an Accesses to OLAP systems are mostly
OLTP system subsist mainly read- only methods because of these data
of short, atomic transactions. warehouses stores historical data
Such a system requires
concurrency control and
recovery techniques.
Access mode Read/write Mostly write
Insert and Short and fast inserts and Periodic long-running batch jobs refresh
Updates updates proposed by end-users. the data.
Number of Tens Millions
records accessed
Normalization Fully Normalized Partially Normalized
Processing Speed Very Fast It depends on the amount of files
contained, batch data refresh, and complex
query may take many hours, and query
speed can be upgraded by
creating indexes

2.9 OLAP operations


In the multidimensional model, the records are organized into various dimensions, and
each dimension includes multiple levels of abstraction described by concept hierarchies. This
organization support users with the flexibility to view data from various perspectives. A number of OLAP
data cube operation exist to demonstrate these different views, allowing interactive queries and search of
the record at hand. Hence, OLAP supports a user-friendly environment for interactive data analysis.

Consider the OLAP operations which are to be performed on multidimensional data. The figure
shows data cubes for sales of a shop. The cube contains the dimensions, location, and time and
item, where the location is aggregated with regard to city values, time is aggregated with respect to
quarters, and an item is aggregated with respect to item types.

1. Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on
a data cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like
zooming- out on the data cubes. Figure shows the result of roll-up operations performed on the dimension
location.
The hierarchy for the location is defined as the Order Street, city, province, or state, country. The
roll-up operation aggregates the data by ascending the location hierarchy from the level of the city to the
level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from
the cube. For example, consider a sales data cube having two dimensions, location and time. Roll-up may
be performed by removing, the time dimensions, appearing in an aggregation of the total sales
by location, relatively than by location and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:

Temperature 64 65 68 69 70 71 72 75 80 81 83 85

Week1 1 0 1 0 1 0 0 0 0 0 1 0

Week2 0 0 0 1 0 0 1 2 0 1 0 0

Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature from the
above cubes.
To do this, we have to group column and add up the value according to the concept hierarchies.
This operation is known as a roll-up.
By doing this, we contain the following cube:

Temperature cool mild hot

Week1 2 1 1

Week2 2 1 1

The roll-up operation groups the information by levels of temperature.


The following diagram illustrates how roll-up works.

2.Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is
like zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-
down can be performed by either stepping down a concept hierarchy for a dimension or adding
additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a
concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears by descending
the time hierarchy from the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a
new dimension to a cube. For example, a drill-down on the central cubes of the figure can
occur by introducing an additional dimension, such as a customer group.
Example
Drill-down adds more details to the given data

Temperature cool mild hot

Day 1 0 0 0

Day 2 0 0 0

Day 3 0 0 1

Day 4 0 1 0

Day 5 1 0 0

Day 6 0 0 0

Day 7 1 0 0

Day 8 0 0 0

Day 9 1 0 0

Day 10 0 1 0

Day 11 0 1 0

Day 12 0 1 0

Day 13 0 0 1

Day 14 0 0 0

The following diagram illustrates how Drill-down works.


3.Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the
dimension. For example, a slice operation is executed when the customer wants a selection on
one dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice
operations perform a selection on one dimension of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:

Temperature cool

Day 1 0

Day 2 0

Day 3 0

Day 4 0

Day 5 1

Day 6 1

Day 7 1
Day 8 1

Day 9 1

Day 11 0

Day 12 0

Day 13 0

Day 14 0
The following diagram illustrates how Slice works.

Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
4.Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cubes we get the following subcube (still two-dimensional)

Temperature cool hot

Day 3 0 1

Day 4 0 0

Consider the following diagram, which shows the dice operations.


The dice operation on the cubes based on the following selection criteria involves three dimensions.
o (location = "Toronto" or "Vancouver")
o (time = "Q1" or "Q2")
o (item =" Mobile" or "Modem")
5.Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the
data axes in view to provide an alternative presentation of the data. It may contain swapping the rows and
columns or moving one of the row-dimensions into the column dimensions.

Consider the following diagram, which shows the pivot operation.

Other OLAP Operations


executes queries containing more than one fact table. The drill-through operations make use of relational
SQL facilitates to drill through the bottom level of a data cubes down to its back-end relational tables.
Other OLAP operations may contain ranking the top-N or bottom-N elements in lists, as well as
calculate moving average, growth rates, and interests, internal rates of returns, depreciation,
currency conversions, and statistical tasks.
OLAP offers analytical modeling capabilities, containing a calculation engine for
determining ratios, variance, etc. and for computing measures across various dimensions. It
can generate summarization, aggregation, and hierarchies at each granularity level and at
every dimensions intersection. OLAP also provide functional models for forecasting, trend analysis, and
statistical analysis. In this context, the OLAP engine is a powerful data analysis tool

2.10 Types of OLAP

There are three main types of OLAP servers are as following:

ROLAP stands for Relational OLAP, an application based on relational DBMSs.

MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.

HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional techniques.
Relational OLAP (ROLAP) Server

These are intermediate servers which stand in between a relational back-end server and
user frontend tools.
They use a relational or extended-relational DBMS to save and handle warehouse data, and OLAP
middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.

ROLAP systems work primarily from the data that resides in a relational database, where the base
data and dimension tables are stored as relational tables. This model permits the multidimensional
analysis of data.

This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of slicing and
dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Relational OLAP Architecture

ROLAP Architecture includes the following components

o Database server.

o ROLAP server.

o Front-end tool.
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the
market. This method allows multiple multidimensional views of two-dimensional relational tables to be
created, avoiding structuring record around the desired view.

Some products in this segment have supported reliable SQL engines to help the complexity
of multidimensional analysis. This includes creating multiple SQL statements to handle user requests,
being
'RDBMS' aware and also being capable of generating the SQL statements based on the optimizer of the
DBMS engine.

Advantages

Can handle large amounts of information: The data size limitation of ROLAP technology is depends
on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data amount.

RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the RDBMS)
can control these functionalities.

Disadvantages

Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in
the relational database, the query time can be prolonged if the underlying data size is large.

Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements
to query the relational database, and SQL statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server

A MOLAP system is based on a native logical model that directly supports multidimensional data
and operations. Data are stored physically into multidimensional arrays, and positional techniques
are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and
are stored in an optimized format in a multidimensional cube, instead of in a relational database.
In MOLAP model, data are structured into proprietary formats by client's reporting requirements with
the calculations pre-generated on the cubes.

MOLAP Architecture

MOLAP Architecture includes the following components

o Database server.
o MOLAP server.

o Front-end tool.

MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been pre-calculated
and stored.

Applications requiring iterative and comprehensive time-series analysis of trends are well suited
for MOLAP technology (e.g., financial analysis and budgeting).

Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship
Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.

Some of the problems faced by clients are related to maintaining support to multiple subject areas
in an RDBMS. Some vendors can solve these problems by continuing access from MOLAP tools
to detailed data in and RDBMS.

This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture
that contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product and sales
region) to be stored and maintained in a persistent structure. This structure would be provided to reduce
the application overhead of performing calculations and building aggregation during initialization. These
structures can be automatically refreshed at predetermined intervals established by an administrator.

Advantages
 Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for
slicing and dicing operations.
 Can perform complex calculations: All evaluation have been pre-generated when the cube
is created. Hence, complex calculations are not only possible, but they return quickly.

Disadvantages
 Limited in the amount of information it can handle: Because all calculations are
performed when the cube is built, it is not possible to contain a large amount of data in the cube
itself.
 Requires additional investment: Cube technology is generally proprietary and does not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are other investments
in human and capital resources are needed.

Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features of MOLAP and ROLAP into a single
architecture. HOLAP systems save more substantial quantities of detailed data in the relational
tables while the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through from
the cube down to the relational tables for delineated data. The Microsoft SQL Server 2000
provides a hybrid OLAP server.
Advantages of HOLAP

1. HOLAP provide benefits of both MOLAP and ROLAP.

2. It provides fast access at all levels of aggregation.

3. HOLAP balances the disk space requirement, as it only stores the aggregate information on the
OLAP server and the detail record remains in the relational database. So no duplicate copy of the
detail record is maintained.

Disadvantages of HOLAP

1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.

Other Types

There are also less popular types of OLAP styles upon which one could stumble upon every so
often. We have listed some of the less popular brands existing in the OLAP industry.

Web-Enabled OLAP (WOLAP) Server

WOLAP pertains to OLAP application which is accessible via the web browser. Unlike traditional
client/server OLAP applications, WOLAP is considered to have a three-tiered architecture which consists
of three components: a client, a middleware, and a database server.

Desktop OLAP (DOLAP) Server

DOLAP permits a user to download a section of the data from the database or source, and work
with that dataset locally, or on their desktop.

Mobile OLAP (MOLAP) Server

Mobile OLAP enables users to access and work on OLAP data and applications remotely through
the use of their mobile devices.

Spatial OLAP (SOLAP) Server

SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP into a
single user interface. It facilitates the management of both spatial and non-spatial data.
2.11 ROLAP Vs MOLAP Vs HOLAP

ROLAP MOLAP HOLAP

ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Analytical
Online Analytical Processing.
Analytical Processing. Processing.

The ROLAP storage The MOLAP storage mode principle The HOLAP storage mode
mode causes the the aggregations of the division and connects attributes of both
aggregation of the division a copy of its source information MOLAP and ROLAP.
to be stored in indexed to be saved in a Like MOLAP, HOLAP
views in the relational multidimensional operation in causes the aggregation of the
database that was specified analysis services when the division to be stored
in the partition's data separation is processed. in a multidimensional
source. operation in an SQL
Server analysis services
instance.
ROLAP does not because a This MOLAP operation is HOLAP does not causes
copy of the source highly optimize to maximize a copy of the
information to be stored query performance. The storage source information to be
in the Analysis services area can be on the computer stored. For queries that
data folders. Instead, when where the partition is described or access the only summary
the outcome cannot be on another computer running record in the aggregations
derived from the query Analysis services. Because a of a division, HOLAP is
cache, the indexed views in copy of the source information the equivalent of MOLAP.
the record source are resides in the
accessed to answer multidimensional operation, queries
queries. can be resolved without accessing
the partition's source record.

Query response is Query response times can be reduced Queries that access source
frequently slower with substantially by using record for example, if we
ROLAP storage than with aggregations. The record in the want to drill down to an
the MOLAP or HOLAP partition's MOLAP operation is only atomic cube cell for
storage mode. Processing as current as of the most recent which there is no
time is also frequently processing of the separation. aggregation information must
slower with ROLAP. retrieve data from the
relational database and will
not be as fast as they would
be if the source
information were stored in the
MOLAP architecture.
Difference between ROLAP and MOLAP

ROLAP MOLAP

ROLAP stands for Relational Online MOLAP stands for Multidimensional Online
Analytical Processing. Analytical
Processing.
It usually used when data warehouse It used when data warehouse contains relational as well
contains relational data. as non-relational data.

It contains Analytical server. It contains the MDDB server.

It creates a multidimensional view of It contains prefabricated data cubes.


data dynamically.

It is very easy to implement It is difficult to implement.

It has a high response time It has less response time due to prefabricated cubes.

It requires less amount of memory. It requires a large amount of memory.


UNIT III META DATA, DATA MART AND PARTITION STRATEGY
Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository – Challenges for Meta
Management - Data Mart – Need of Data Mart- Cost Effective Data Mart- Designing Data Marts- Cost of
Data Marts- Partitioning Strategy – Vertical partition – Normalization – Row Splitting –
Horizontal Partition

3.1 Meta Data:

Metadata is simply defined as data about data. The data that is used to represent other data is known as
metadata. For example, the index of a book serves as a metadata for the contents in the book. In other
words, we can say that metadata is the summarized data that leads us to detailed data. In terms of data
warehouse, we can define metadata as follows.

 Metadata is the road-map to a data warehouse.

 Metadata in a data warehouse defines the warehouse objects.

 Metadata acts as a directory. This directory helps the decision support system to locate
the contents of a data warehouse.

Note − In a data warehouse, we create metadata for the data names and definitions of a given
data warehouse. Along with this metadata, additional metadata is also created for time-stamping any
extracted data, the source of extracted data.

Metadata can be stored in various forms, such as text, XML, or RDF, and can be organized
using metadata standards and schemas. There are many metadata standards that have been
developed to facilitate the creation and management of metadata, such as Dublin Core, schema.org, and
the Metadata Encoding and Transmission Standard (METS). Metadata schemas define the structure
and format of metadata and provide a consistent framework for organizing and describing data.

Several examples of metadata are:

1. A library catalog may be considered metadata. The directory metadata consists of several
predefined components representing specific attributes of a resource, and each item can have one
or more values. These components could be the name of the author, the name of the document,
the publisher's name, the publication date, and the methods to which it belongs.

2. The table of content and the index in a book may be treated metadata for the book.

3. Suppose we say that a data item about a person is 80. This must be defined by noting that it is the
person's weight and the unit is kilograms. Therefore, (weight, kilograms) is the metadata about the
data is 80.

4. Another example of metadata are data about the tables and figures in a report like this book. A
table (which is a record) has a name (e.g., table titles), and there are column names of the tables
that may be treated metadata. The figures also have titles or names.

3.2 Categories of Metadata

Metadata can be broadly categorized into three categories −


 Business Metadata − It has the data ownership information, business definition, and
changing
policies.

 Technical Metadata − It includes database system names, table and column names and sizes, data
types and allowed values. Technical metadata also includes structural information such as primary
and foreign key attributes and indices.

 Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.

Types of Metadata

Metadata in a data warehouse fall into three major parts:

o Operational Metadata

o Extraction and Transformation Metadata

o End-User Metadata

Operational Metadata
As we know, data for the data warehouse comes from various operational systems of the
enterprise. These source systems include different data structures. The data elements selected for the data
warehouse have various fields lengths and data types.
In selecting information from the source systems for the data warehouses, we divide records, combine
factor of documents from different source files, and deal with multiple coding schemes and field lengths.
When we deliver information to the end-users, we must be able to tie that back to the source data sets.
Operational metadata contains all of this information about the operational data sources.

Extraction and Transformation Metadata


Extraction and transformation metadata include data about the removal of data from the
source systems, namely, the extraction frequencies, extraction methods, and business rules for
the data extraction. Also, this category of metadata contains information about all the data
transformation that takes place in the data staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables the end-users to
find data from the data warehouses. The end-user metadata allows the end-users to use their
business terminology and look for the information in those ways in which they usually think of the
business.

3.3 Role of Metadata

Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The various roles
of metadata are explained below.
 Metadata acts as a directory.

 This directory helps the decision support system to locate the contents of the data warehouse.

 Metadata helps in decision support system for mapping of data when data is transformed
from operational environment to data warehouse environment.

 Metadata helps in summarization between current detailed data and highly summarized data.

 Metadata also helps in summarization between lightly detailed data and highly summarized data.

 Metadata is used for query tools.

 Metadata is used in extraction and cleansing tools.

 Metadata is used in reporting tools.

 Metadata is used in transformation tools.

 Metadata plays an important role in loading functions.

The following diagram shows the roles of metadata.


3.4 Metadata Repository

Metadata repository is an integral part of a data warehouse system. It has the following metadata −

 Definition of data warehouse − It includes the description of structure of data warehouse. The
description is defined by schema, view, hierarchies, derived data definitions, and data mart
locations and contents.

 Business metadata − It contains has the data ownership information, business definition,
and changing policies.

 Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data migrated
and transformation applied on it.

 Data for mapping from operational environment to data warehouse − It includes the source
databases and their contents, data extraction, data partition cleaning, transformation rules,
data refresh and purging rules.

 Algorithms for summarization − It includes dimension algorithms, data on granularity,


aggregation, summarizing, etc.

Benefits of Metadata Repository

A metadata repository is a centralized database or system that is used to store and manage
metadata. Some of the benefits of using a metadata repository include:

1. Improved data quality: A metadata repository can help ensure that metadata is
consistently structured and accurate, which can improve the overall quality of the data.

2. Increased data accessibility: A metadata repository can make it easier for users to access
and understand the data, by providing context and information about the data.

3. Enhanced data integration: A metadata repository can facilitate data integration by providing a
common place to store and manage metadata from multiple sources.

4. Improved data governance: A metadata repository can help enforce metadata standards
and policies, making it easier to ensure that data is being used and managed appropriately.

5. Enhanced data security: A metadata repository can help protect the privacy and security
of metadata, by providing controls to restrict access to sensitive or confidential information.

Metadata repositories can provide many benefits in terms of improving the quality, accessibility,
and management of data.

Benefits of Metadata Repository

1. It provides a set of tools for enterprise-wide metadata management.

2. It eliminates and reduces inconsistency, redundancy, and underutilization.

3. It improves organization control, simplifies management, and accounting of information assets.

4. It increases coordination, understanding, identification, and utilization of information assets.


5. It enforces CASE development standards with the ability to share and reuse metadata.

6. It leverages investment in legacy systems and utilizes existing applications.

7. It provides a relational model for heterogeneous RDBMS to share information.

8. It gives useful data administration tool to manage corporate information assets with the
data dictionary.

9. It increases reliability, control, and flexibility of the application development process.

3.5 Challenges for Meta Management:

There are several challenges that can arise when managing metadata:

1. Lack of standardization: Different organizations or systems may use different standards or


conventions for metadata, which can make it difficult to effectively manage metadata
across different sources.

2. Data quality: Poorly structured or incorrect metadata can lead to problems with data
quality, making it more difficult to use and understand the data.

3. Data integration: When integrating data from multiple sources, it can be challenging to ensure
that the metadata is consistent and aligned across the different sources.

4. Data governance: Establishing and enforcing metadata standards and policies can be
difficult, especially in large organizations with multiple stakeholders.

5. Data security: Ensuring the security and privacy of metadata can be a challenge, especially when
working with sensitive or confidential information.

3.6 Data Mart

A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose
or primary data subject which may be distributed to provide business needs. Data Marts are
analytical record stores designed to focus on particular business functions for a specific
community within an organization. Data marts are derived from subsets of data in a data warehouse,
though in the bottom-up data warehouse design methodology, the data warehouse is created from the
union of organizational data marts.

The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to gather, store,
access, and analyze record. It can be used by smaller businesses to utilize the data they have accumulated
since it is less expensive than implementing a data warehouse.
Reasons for creating a data mart

o Creates collective data by a group of users

o Easy access to frequently needed data

o Ease of creation

o Improves end-user response time

o Lower cost than implementing a complete data warehouses

o Potential clients are more clearly defined than in a comprehensive data warehouse

o It contains only essential business data and is less cluttered.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts

o Independent Data Marts

Dependent Data Marts

A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to
this technique, the data marts are treated as the subsets of a data warehouse. In this technique, firstly a
data warehouse is created from which further various data marts can be created. These data mart
are dependent on the data warehouse and extract the essential record from it. In this technique, as the data
warehouse creates the data mart; therefore, there is no need for data mart integration. It is also known as a
top-down approach.
Independent Data Marts

The second approach is Independent data marts (IDM) Here, firstly independent data marts are created,
and then a data warehouse is designed using these independent multiple data marts. In this approach, as
all the data marts are designed independently; therefore, the integration of data marts is required. It is also
termed as a bottom-up approach as the data marts are integrated to develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data Marts."

Hybrid Data Marts

It allows us to combine input from sources other than a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are needed, such as after a new group or product is added
to the organizations.

Need Of Data Mart:

1. Data Mart focuses only on functioning of particular department of an organization.

2. It is maintained by single authority of an organization.

3. Since, it stores the data related to specific part of an organisation, data retrieval from it is very
quick.
4. Designing and maintenance of data mart is found to be quite cinch as compared to data
warehouse.

5. It reduces the response time of user as it stores small volume of data.

6. It is small in size due to which accessing data from it very fast.

7. This Storage unit is used by most of organizations for the smooth running of their departments.

Advantages of Data Mart:

1. Implementation of data mart needs less time as compared to implementation of datawarehouse as


data mart is designed for a particular department of an organization.

2. Organizations are provided with choices to choose model of data mart depending upon cost and
their business.

3. Data can be easily accessed from data mart.

4. It contains frequently accessed queries, so enable to analyse business trend.

Disadvantages of Data Mart:

1. Since it stores the data related only to specific function, so does not store huge volume of data
related to each and every department of an organization like datawarehouse.

2. Creating too many data marts becomes cumbersome sometimes.

Features of data marts:

Subset of Data: Data marts are designed to store a subset of data from a larger data warehouse or data
lake. This allows for faster query performance since the data in the data mart is focused on a
specific business unit or department.

Optimized for Query Performance: Data marts are optimized for query performance, which means that
they are designed to support fast queries and analysis of the data stored in the data mart.

Customizable: Data marts are customizable, which means that they can be designed to meet the specific
needs of a business unit or department.

Self-Contained: Data marts are self-contained, which means that they have their own set of
tables, indexes, and data models. This allows for easier management and maintenance of the data mart.

Security: Data marts can be secured, which means that access to the data in the data mart can
be controlled and restricted to specific users or groups.

Scalability: Data marts can be scaled horizontally or vertically to accommodate larger volumes of data or
to support more users.

Integration with Business Intelligence Tools: Data marts can be integrated with business intelligence
tools, such as Tableau, Power BI, or QlikView, which allows users to analyze and visualize the
data stored in the data mart.
ETL Process: Data marts are typically populated using an Extract, Transform, Load (ETL)
process, which means that data is extracted from the larger data warehouse or data lake, transformed to
meet the requirements of the data mart, and loaded into the data
mart.

3.7 Why Do We Need a Data Mart?

Listed below are the reasons to create a data mart −

 To partition data in order to impose access control strategies.

 To speed up the queries by reducing the volume of data to be scanned.

 To segment data into different hardware platforms.

 To structure data in a form suitable for a user access tool.

Note − Do not data mart for any other reason since the operation cost of data marting could be very high.
Before data marting, make sure that data marting strategy is appropriate for your particular solution.

3.8 Cost-effective Data Marting

Follow the steps given below to make data marting cost-effective −

 Identify the Functional Splits

 Identify User Access Tool Requirements

 Identify Access Control Issues

Identify the Functional Splits

In this step, we determine if the organization has natural functional splits. We look for
departmental splits, and we determine whether the way in which departments use information tend to be
in isolation from the rest of the organization. Let's have an example.

Consider a retail organization, where each merchant is accountable for maximizing the sales of a group of
products. For this, the following are the valuable information −

 sales transaction on a daily basis

 sales forecast on a weekly basis

 stock position on a daily basis

 stock movements on a daily basis

As the merchant is not interested in the products they are not dealing with, the data marting is a subset of
the data dealing which the product group of interest. The following diagram shows data marting
for different users.
Given below are the issues to be taken into account while determining the functional split −

 The structure of the department may change.

 The products might switch from one department to other.

 The merchant could query the sales trend of other products to analyze what is happening to the
sales.

Note − we need to determine the business benefits and technical feasibility of using a data mart.

Identify User Access Tool Requirements

We need data marts to support user access tools that require internal data structures. The data in
such structures are outside the control of data warehouse but need to be populated and updated on a
regular basis.

There are some tools that populate directly from the source system but some cannot. Therefore additional
requirements outside the scope of the tool are needed to be identified for future.

Note − In order to ensure consistency of data across all access tools, the data should not be
directly
populated from the data warehouse, rather each tool must have its own data mart.

Identify Access Control Issues

There should to be privacy rules to ensure the data is accessed by authorized users only. For example a
data warehouse for retail banking institution ensures that all the accounts belong to the same legal entity.
Privacy laws can force you to totally prevent access to information that is not owned by the specific bank.

Data marts allow us to build a complete wall by physically separating data segments within the
data warehouse. To avoid possible privacy problems, the detailed data can be removed from the
data warehouse. We can create data mart for each legal entity and load it via data warehouse, with
detailed account data.
3.9 Designing Data Marts

Data marts should be designed as a smaller version of starflake schema within the data warehouse and
should match with the database design of the data warehouse. It helps in maintaining control
over database instances.

The summaries are data marted in the same way as they would have been designed within the
data warehouse. Summary tables help to utilize all dimension data in the starflake schema.

The significant steps in implementing a data mart are to design the schema, construct the
physical storage, populate the data mart with data from source systems, access it to make informed
decisions and manage it over time. So, the steps are:

Designing

The design step is the first in the data mart process. This phase covers all of the functions from initiating
the request for a data mart through gathering data about the requirements and developing the logical and
physical design of the data mart.

It involves the following tasks:

1. Gathering the business and technical requirements

2. Identifying data sources

3. Selecting the appropriate subset of data

4. Designing the logical and physical architecture of the data mart.

Constructing

This step contains creating the physical database and logical structures associated with the data mart to
provide fast and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces associated with the data
mart.
2. creating the schema objects such as tables and indexes describe in the design step.

3. Determining how best to set up the tables and access structures.

Populating

This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it
to the right format and level of detail, and moving it into the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources

2. Extracting data

3. Cleansing and transforming the information.

4. Loading data into the data mart

5. Creating and storing metadata

Accessing

This step involves putting the data to use: querying the data, analyzing it, creating reports, charts
and graphs and publishing them.

It involves the following tasks:

1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that the end-clients
can interact with the data mart using words which relates to the business functions.

2. Set up and manage database architectures like summarized tables which help queries agree
through the front-end tools execute rapidly and efficiently.

Managing

This step contains managing the data mart over its lifetime. In this step, management functions
are performed as:

1. Providing secure access to the data.

2. Managing the growth of the data.

3. Optimizing the system for better performance.

4. Ensuring the availability of data event with system failures.

3.10 Cost of Data Marting

The cost measures for data marting are as follows −

 Hardware and Software Cost


 Network Access

 Time Window Constraints

Hardware and Software Cost

Although data marts are created on the same hardware, they require some additional hardware
and software. To handle user queries, it requires additional processing power and disk storage. If detailed
data and the data mart exist within the data warehouse, then we would face additional cost to
store and manage replicated data.

Note − Data marting is more expensive than aggregations, therefore it should be used as an
additional strategy and not as an alternative strategy.

Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that the LAN
or WAN has the capacity to handle the data volumes being transferred within the data mart
load process.

Time Window Constraints


The extent to which a data mart loading process will eat into the available time window depends on the
complexity of the transformations and the data volumes being shipped. The determination of how many
data marts are possible depends on −

 Network capacity.

 Time window available

 Volume of data being transferred

 Mechanisms being used to insert data into a data mart

3.11 Partitioning Strategy

Using data partitioning techniques, a huge dataset can be divided into smaller, simpler sections. A few
applications for these techniques include parallel computing, distributed systems, and
database administration. Data partitioning aims to improve data processing performance,
scalability, and efficiency.

Why is it Necessary to Partition?

Partitioning is important for the following reasons −

 For easy management,

 To assist backup/recovery,

 To enhance performance.

For Easy Management


The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact
table is very hard to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the
data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time
to load and also enhances the performance of the system.

Note − To cut down on the backup size, all partitions other than the current partition can be marked as
read-only. We can then put these partitions into a state where they cannot be modified. Then they can be
backed up. It means only the current partition is to be backed up.

To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance
is enhanced because now the query scans only those partitions that are relevant. It does not have to scan
the whole data.

The list of popular data partitioning techniques is as follows:

1. Horizontal Partitioning

2. Vertical Partitioning

3. Key-based Partitioning

4. Range-based Partitioning

5. Hash-based Partitioning

6. Round-robin Partitioning

Now let us discuss each partitioning in detail that is as follows:

1. Horizontal Partitioning/Sharding
In this technique, the dataset is divided based on rows or records. Each partition contains a
subset of rows, and the partitions are typically distributed across multiple servers or storage devices.
Horizontal partitioning is often used in distributed databases or systems to improve parallelism
and enable load balancing.

There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have
to keep in mind the requirements for manageability of the data warehouse.

Partitioning by Time into Equal Segments


In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period
represents a significant retention period within the business. For example, if the user queries for month to
date data then it is appropriate to partition the data into monthly segments. We can reuse the
partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments

This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.
Points to Note
 The detailed information remains available online.

 The number of physical tables is kept relatively small, which reduces the operating cost.

 This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.

 This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.

Partition on a Different Dimension

The fact table can also be partitioned on the basis of dimensions other than time such as product group,
region, supplier, or any other dimension. Let's have an example.

Suppose a market function has been structured into distinct regional departments like on a state
by state basis. If each region wants to query on information captured within its region, it would prove to
be more effective to partition the fact table into regional partitions. This will cause the queries to speed up
because it does not require to scan information that is not relevant.

Points to Note

 The query does not have to scan irrelevant data which speeds up the query process.

 This technique is not appropriate where the dimensions are unlikely to change in future. So, it is
worth determining that the dimension does not change in future.

 If the dimension changes, then the entire fact table would have to be repartitioned.

Note − We recommend to perform the partition only on the basis of time dimension, unless you
are certain that the suggested dimension grouping will not change within the life of the data warehouse.

Partition by Size of Table

When there are no clear basis for partitioning the fact table on any dimension, then we should partition
the fact table on the basis of their size. We can set the predetermined size as a critical point. When the
table exceeds the predetermined size, a new table partition is created.

Points to Note

 This partitioning is complex to manage.


 It requires metadata to identify what data is stored in each partition.

Partitioning Dimensions

If a dimension contains large number of entries, then it is required to partition the dimensions. Here we
have to check the size of a dimension.

Consider a large design that changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the response time.

Round Robin Partitions

In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to
allow user access tool to refer to the correct table partition.

This technique makes it easy to automate table management facilities within the data warehouse.

Advantages:

1. Greater scalability: By distributing data among several servers or storage devices,


horizontal partitioning makes it possible to process large datasets in parallel.

2. Load balancing: By partitioning data, the workload can be distributed equally among
several nodes, avoiding bottlenecks and enhancing system performance.

3. Data separation: Since each partition can be managed independently, data isolation and
fault tolerance are improved. The other partitions can carry on operating even if one fails.

Disadvantages:

1. Join operations: Horizontal partitioning can make join operations across multiple partitions more
complex and potentially slower, as data needs to be fetched from different nodes.

2. Data skew: If the distribution of data is uneven or if some partitions receive more queries
or updates than others, it can result in data skew, impacting performance and load balancing.

3. Distributed transaction management: Ensuring transactional consistency across multiple partitions


can be challenging, requiring additional coordination mechanisms.

2. Vertical Partitioning

Unlike horizontal partitioning, vertical partitioning divides the dataset based on columns or attributes. In
this technique, each partition contains a subset of columns for each row. Vertical partitioning is useful
when different columns have varying access patterns or when some columns are more frequently
accessed than others.

Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is
done.
Vertical partitioning can be performed in the following two ways −

 Normalization

 Row Splitting

1. Normalization

Normalization is the standard relational method of database organization. In this method, the rows
are collapsed into a single row, hence it reduce space. Take a look at the following tables that
show how normalization is performed.

Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location


Region

30 5 3.67 3-Aug-13 16 sunny Bangalore S

35 4 5.33 3-Sep-13 16 sunny Bangalore S

40 5 2.50 3-Sep-13 64 san Mumbai W

45 7 5.66 3-Sep-13 16 sunny Bangalore S


Table after Normalization

Store_id Store_name Location Region

16 sunny Bangalore W

64 san Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

DATA NORMALIZATION RULES


Data normalization rules are sequential—as you move through each rule, you normalize the data further.
For this reason, you can think of normalization rules as “levels” of normalization. Although there are five
rules in total, only three are commonly used for practical reasons, since too much normalization results in
inflexibility in the data model.

1NF (First Normal Form) rule


The first rule is about ensuring there are no repeating entries in a group. All attributes must have a unique
name, and entities may consist of only two dimensions. (You’ll need to create a new entity for additional
dimensions.) Each entry must have only a single value for each cell, and each record must be unique. The
goal of this first rule is to make it easier to search the data.

2NF (Second Normal Form) rule


The second rule is designed to eliminate redundant data. Before applying the 2NF rule, you must be sure
that the 1NF rule has been fully applied. Data that is in the 2NF state will have only one primary key. For
this reason, you must separate data by placing all subsets of data that can be placed in multiple rows into
separate tables. Relationships can then be created via foreign key labels.
3NF (Third Normal Form) rule
The third rule eliminates transitive dependency. As before, data must have achieved 1NF and 2NF status
before you can apply the 3NF rule. Transitive dependency is when a nonprime attribute depends on other
nonprime attributes instead of depending on the prime attributes or primary key. So the third rule ensures
that no attribute within an entity is dependent on a nonprime attribute that depends on the primary key.
For this reason, if the primary key is changed, all data impacted by the change has to be put into a new
table with a new foreign key.
BENEFITS OF DATA NORMALIZATION
Before we dive into the specifics of the rules involved in data normalization, let’s look at the benefits you
can expect from the normalization process.

 Eliminate duplicates for efficiency: Much of the work that normalization accomplishes is related
to eliminating duplicates, which frees space and allows your systems to run more efficiently.

 Ensure data is logically organized: Normalization applies a set of rules to associate attributes with
the data, allowing it to be organized by attribute. Because of this, you can organize data
more effectively.

 Increase accuracy: One of the goals of normalization is standardization, which


simultaneously ensures that the data is accurate.

 Query data from multiple sources easily: When data is standardized, you can combine data sets
and run queries on the larger set. Many analyses require looking at multiple data sets
from multiple sources.

 More easily update data: When data has been normalized, you can more easily update it because
you don’t have to be concerned about updating duplicates.

 Facilitate better data governance: Better governance is one of the most valuable benefits of data
normalization. The normalization process is an essential part of creating high-quality data that’s
been vetted and is ready for use. And because normalization also allows you to
effectively organize data, you can track it more easily.

Advantages of Normalization:
Here we can perceive any reason why Normalization is an alluring possibility in RDBMS ideas.
 A more modest information base can be kept up as standardization disposes of the copy
information. Generally speaking size of the information base is diminished thus.

 Better execution is guaranteed which can be connected to the above point. As information bases
become lesser in size, the goes through the information turns out to be quicker and more limited
in this way improving reaction time and speed.

 Narrower tables are conceivable as standardized tables will be tweaked and will have
lesser segments which considers more information records per page.

 Fewer files per table guarantees quicker support assignments (file modifies).

 Also understands the choice of joining just the tables that are required.

 Reduces data redundancy and inconsistency: Normalization eliminates data redundancy and
ensures that each piece of data is stored in only one place, reducing the risk of data inconsistency
and making it easier to maintain data accuracy.

 Improves data integrity: By breaking down data into smaller, more specific tables, normalization
helps ensure that each table stores only relevant data, which improves the overall data integrity of
the database.
 Facilitates data updates: Normalization simplifies the process of updating data, as it only needs to
be changed in one place rather than in multiple places throughout the database.

 Simplifies database design: Normalization provides a systematic approach to database design that
can simplify the process and make it easier to develop and maintain the database over time.

 Supports flexible queries: Normalization enables users to query the database using a variety
of different criteria, as the data is organized into smaller, more specific tables that can be
joined together as needed.

 Helps ensure database scalability: Normalization helps ensure that the database can scale to meet
future needs by reducing data redundancy and ensuring that the data is organized in a way that
supports future growth and development.

 Supports data consistency across applications: Normalization can help ensure that data is
consistent across different applications that use the same database, making it easier to
integrate different applications and ensuring that all users have access to accurate and consistent
data.

Disadvantages of Normalization:
 More tables to join as by spreading out information into more tables, the need to join
table’s increments and the undertaking turns out to be more dreary. The information base
gets more enthusiastically to acknowledge too.

 Tables will contain codes as opposed to genuine information as the rehashed information will be
put away as lines of codes instead of the genuine information. Thusly, there is consistently a need
to go to the query table.

 Data model turns out to be incredibly hard to question against as the information model
is advanced for applications, not for impromptu questioning. (Impromptu question is an inquiry
that can’t be resolved before the issuance of the question. It comprises of a SQL that is
developed progressively and is typically built by work area cordial question devices.).
Subsequently it is difficult to display the information base without understanding what the client
wants.

 As the typical structure type advances, the exhibition turns out to be increasingly slow.

 Proper information is needed on the different ordinary structures to execute the


standardization cycle effectively. Reckless use may prompt awful plan loaded up with significant
peculiarities and information irregularity.

 Increased complexity: Normalization can increase the complexity of a database design, especially
if the data model is not well understood or if the normalization process is not carried out correctly.
This can lead to difficulty in maintaining and updating the database over time.
 Reduced flexibility: Normalization can limit the flexibility of a database, as it requires data to be
organized in a specific way. This can make it difficult to accommodate changes in the data or to
create new reports or applications that require different data structures.

 Increased storage requirements: Normalization can increase the storage requirements of a


database, as it may require more tables and additional join operations to access the data. This can
also increase the complexity and cost of the hardware required to support the database.

 Performance overhead: Normalization can result in increased performance overhead due to


the need for additional join operations and the potential for slower query execution times.

 Loss of data context: Normalization can result in the loss of data context, as data may be split
across multiple tables and require additional joins to retrieve. This can make it harder to
understand the relationships between different pieces of data.

 Potential for data update anomalies: Normalization can introduce the potential for data
update anomalies, such as insert, update, and delete anomalies, if the database is not properly
designed and maintained.

 Need for expert knowledge: Proper implementation of normalization requires expert knowledge
of database design and the normalization process. Without this knowledge, the database may not
be optimized for performance, and data consistency may be compromised.

2. Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed
up the access to large table by reducing its size.

Note − While using vertical partitioning, make sure that there is no requirement to perform a major join
operation between two partitions.

Identify Key to Partition

It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the following table.

Account_Txn_Table

transaction_id

account_id

transaction_type

value

transaction_date

region

branch_name

We can choose to partition on any key. The two possible keys could be
 region

 transaction_date

Suppose the business is organized in 30 geographical regions and each region has different number of
branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because
our requirements capture has shown that a vast majority of queries are restricted to the user's
own business region.

If we partition by transaction_date instead of region, then the latest transaction from every region will be
in one partition. Now the user who wants to look at data within his own region has to query
across multiple partitions.

Hence it is worth determining the right partitioning key.


Advantages:

1. Improved query performance: By placing frequently accessed columns in a separate


partition, vertical partitioning can enhance query performance by reducing the amount of data
read from storage.

2. Efficient data retrieval: When a query only requires a subset of columns, vertical
partitioning allows retrieving only the necessary data, saving storage and I/O resources.

3. Simplified schema management: With vertical partitioning, adding or removing columns becomes
easier, as the changes only affect the respective partitions.

Disadvantages:

1. Increased complexity: Vertical partitioning can lead to more complex query execution plans, as
queries may need to access multiple partitions to gather all the required data.

2. Joins across partitions: Joining data from different partitions can be more complex and potentially
slower, as it involves retrieving data from different partitions and combining them.

3. Limited scalability: Vertical partitioning may not be as effective for datasets that
continuously grow in terms of the number of columns, as adding new columns may require
restructuring the partitions.

3. Key-based Partitioning

Using this method, the data is divided based on a particular key or attribute value. The dataset has been
partitioned, with each containing all the data related to a specific key value. Key-based partitioning
is commonly used in distributed databases or systems to distribute the data evenly and allow efficient data
retrieval based on key lookups.

Advantages:

1. Even data distribution: Key-based partitioning ensures that data with the same key value is stored
in the same partition, enabling efficient data retrieval by key lookups.

2. Scalability: Key-based partitioning can distribute data evenly across partitions, allowing for better
parallelism and improved scalability.

3. Load balancing: By distributing data based on key values, the workload is balanced across
multiple partitions, preventing hotspots and optimizing performance.

Disadvantages:

1. Skew and hotspots: If the key distribution is uneven or if certain key values are more frequently
accessed than others, it can lead to data skew or hotspots, impacting performance and
load balancing.

2. Limited query flexibility: Key-based partitioning is most efficient for queries that primarily
involve key lookups. Queries that span multiple keys or require range queries may suffer
from increased complexity and potentially slower performance.

3. Partition management: Managing partitions based on key values requires careful planning
and maintenance, especially when the dataset grows or the key distribution changes.
4. Range Partitioning

Range partitioning divides the dataset according to a predetermined range of values. You can divide data
based on a particular time range, for instance, if your dataset contains timestamps. When you
want to distribute data evenly based on the range of values and have data with natural ordering, range
partitioning can be helpful.

Advantages:

1. Natural ordering: Range partitioning is suitable for datasets with a natural ordering based
on a specific attribute. It allows for efficient data retrieval based on ranges of values.

2. Even data distribution: By dividing the dataset based on ranges, range partitioning can distribute
the data evenly across partitions, ensuring load balancing and optimal performance.

3. Simplified query planning: Range partitioning simplifies query planning when queries primarily
involve range-based conditions, as the system knows which partition(s) to access based on
the range specified.

Disadvantages:

1. Uneven data distribution: If the data distribution is not evenly distributed across ranges, it
can lead to data skew and impact load balancing and query performance.

2. Data growth challenges: As the dataset grows, the ranges may need to be adjusted or new
partitions added, requiring careful management and potentially affecting existing queries and data
distribution.

3. Joins and range queries: Range partitioning can introduce complexity when performing
joins across partitions or when queries involve multiple non-contiguous ranges, potentially
leading to performance challenges.

5. Hash-based Partitioning

Hash partitioning is the process of analyzing the data using a hash function to decide which division it
belongs to. The data is fed into the hash function, which produces a hash value used to categorize the data
into a certain division. By randomly distributing data among partitions, hash-based partitioning can help
with load balancing and quick data retrieval.

Advantages:

1. Even data distribution: Hash-based partitioning provides a random distribution of data


across partitions, ensuring even data distribution and load balancing.

2. Scalability: Hash-based partitioning enables scalable parallel processing by evenly


distributing data across multiple nodes.

3. Simpleness: Hash-based partitioning does not depend on any particular data properties or
ordering, and it is relatively easy to implement.
Disadvantages:

1. Key-based queries: Hash-based partitioning is not suitable for efficient key-based lookups, as the
data is distributed randomly across partitions. Key-based queries may require searching
across multiple partitions.

2. Load balancing challenges: In some cases, the distribution of data may not be perfectly balanced,
resulting in load imbalances and potential performance issues.

3. Partition management: Hash-based partitioning may require adjustments to the number of


partitions or hash functions as the dataset grows or the system requirements change, necessitating
careful management and potential data redistribution.

6. Round-robin Partitioning

In round-robin partitioning, data is evenly distributed across partitions in a cyclic manner. Each partition
is assigned the next available data item sequentially, regardless of the data’s characteristics. Round-robin
partitioning is straightforward to implement and can provide a basic level of load balancing.

Advantages:

1. Simple implementation: Round-robin partitioning is straightforward to implement, as it


assigns data items to partitions in a cyclic manner without relying on any specific data
characteristics.

2. Basic load balancing: Round-robin partitioning can provide a basic level of load balancing,
ensuring that data is distributed across partitions evenly.

3. Scalability: It is made possible by round-robin partitioning, which divides the data into
several parts and permits parallel processing.

Disadvantages:

1. Uneven data distribution or a number of partitions that are not a multiple of the total number of
data items may cause round-robin partitioning to produce unequal partition sizes.

2. Inefficient data retrieval: Round-robin partitioning does not consider any data characteristics
or access patterns, which may result in inefficient data retrieval for certain queries.

3. Limited query optimization: Round-robin partitioning does not optimize for specific query
patterns or access patterns, potentially leading to suboptimal query performance.
Partitioning Suitable Query Data
Technique Description Data Performance Distribution Complexity

Distributed
Divides dataset based Large Complex Uneven
Horizontal transaction
on rows/records datasets joins distribution
Partitioning management

Vertical Divides dataset based Wide Improved Efficient Increased query


Partitioning on columns/attributes tables retrieval storage complexity

Limited query
Divides dataset based Key-value Efficient key Even flexibility
Key-based on specific key datasets lookups distribution
Partitioning by key

Even Joins and range


Divides dataset based Ordered Efficient queries
Range distribution
on specific range datasets range queries
Partitioning by range

Hash-based Divides dataset based Unordered Even Random Inefficient key-


Partitioning on hash function datasets distribution distribution based queries

Round-robin Divides dataset in Equal-sized Basic load Even Limited query


Partitioning a cyclic manner datasets balancing distribution optimization
Difference between Vertical and Horizontal Partitioning

Feature Vertical Partitioning Horizontal Partitioning

Dividing a table into smaller tables Dividing a table into smaller tables based
Definition based on columns. on rows (usually ranges of rows).

Reduce the number of columns in a Divide a table into smaller tables


Purpose table to improve query performance to manage large volumes of data
and reduce I/O. efficiently.

Data Columns with related data are Rows with related data (typically based
distribution placed together in the same table. on a range or a condition) are placed
together in the same table.

Query Improves query performance when Improves query performance when


performance queries only involve specific queries primarily access a subset of rows
columns that are part of a partition. in a large table.

Maintenance Easier to manage and index specific Each partition can be indexed
and indexing columns based on their independently, making indexing
characteristics and access patterns. more efficient.

May require joins to combine Joins between partitions are typically not
Joins data from multiple partitions needed, as they contain disjoint sets of
when querying. data.

Easier to maintain data integrity, as


Ensuring data consistency across each partition contains a self-contained
Data integrity partitions can be more challenging.
subset of data.

Commonly used for tables with a wide Commonly used for tables with a large
range of columns, where not all number of rows, where data can be
Use cases
columns are frequently grouped based on some criteria (e.g., date
accessed together. ranges).

Splitting a customer table into Partitioning a large sales order table by


Examples one table for personal details and date, with each partition containing orders
another for transaction history. from a specific month or year.
Difference between Data Warehouse and Data Mart

Data Warehouse Data Mart

A Data Warehouse is a vast repository of A data mart is an only subtype of a Data Warehouses. It
information collected from various organizations is architecture to meet the requirement of a specific
or departments within a corporation. user group.

It may hold multiple subject areas. It holds only one subject area. For example, Finance or
Sales.

It holds very detailed information. It may hold more summarized data.

Works to integrate all data sources It concentrates on integrating data from a given subject
area or set of source systems.

In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake Schema are
used.

It is a Centralized System. It is a Decentralized system.

Data Warehousing is the data-oriented. Data Marts is a project-oriented.

You might also like