Notes Format
Notes Format
Subject-oriented: A data warehouse is systematized around major subjects such as customer, supplier,
product, and sales. Rather than focussed on the day-to-day operations and transaction processing of an
organization, a data warehouse emphases on the modeling and analysis of data for decision-
makers. Henceforth, data warehouses usually provide a simple and succinct view of particular subject
issues by excluding data that are not useful in the decision support process.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such
as relational databases, flat files, and online transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding
structures, attribute measures, and so on.
Time-variant: Data is stored to provide information from a historic perspective (e.g., the past 5–
10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.
Non-volatile: A data warehouse is always a physically separate store of data transformed from
the application data found in the operational environment. Due to this separation, a data warehouse does
not require transaction processing, recovery, and concurrency control mechanisms. It usually requires
only two operations in data accessing: initial loading of data and access of data.
the other hand, data warehouse queries are often complex. They involve the computation of large data
groups at summarized levels and may require the use of special data organization, access, and
implementation methods based on multidimensional views. Processing OLAP queries in operational
databases would substantially degrade the performance of operational tasks.
Moreover, an operational database supports the concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms (e.g., locking and logging) are required
to ensure the consistency and robustness of transactions. An OLAP query often needs read-only access
to data records for summarization and aggregation. Concurrency control and recovery mechanisms, if
applied for such OLAP operations, may jeopardize the execution of concurrent transactions and thus
substantially reduce the throughput of an OLTP system.
Finally, the separation of operational databases from data warehouses is based on the
different structures, contents, and uses of the data in these two systems
• Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database servers.
Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the
given dimensions. The result would form a lattice of cuboids, each showing the data at a
different level of summarization or group-by.
The lattice of cuboids is then referred to as a data cube. Figure 4 shows a lattice of cuboids forming a
data cube for the dimensions time, item, location, and supplier.
The cuboid that holds the lowest level of summarization is called the base cuboid. For example,
the 4-D cuboid in Figure 3 is the base cuboid for the given time, item, location, and supplier dimensions.
Figure 2 is a 3-D (non-base) cuboid for time, item, and location summarized for all suppliers. The 0- D
cuboid, which holds the highest level of summarization, is called the apex cuboid. In our example, this is
the total sales, or dollars sold, summarized over all four dimensions. The apex cuboid is typically denoted
by all.”
Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.
A star schema for AllElectronics sales is shown in Figure 5. Sales are considered along
four dimensions: time, item, branch, and location. The schema contains a central fact table for
sales that contain keys to each of the four dimensions, along with two measures: dollars sold and units
sold.
To minimize the size of the fact table, dimension identifiers (e.g., time key and item key)
are system generated identifiers. Notice that in the star schema, each dimension is represented by only
one table, and each table contains a set of attributes
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph
forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension tables
of the snowflake model may be kept in the normalized form to reduce redundancies. Such a table is easy
to maintain and saves storage space.
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation
A fact constellation schema is shown in Figure 7. This schema specifies two fact tables, sales, and
shipping. The sales table definition is identical to that of the star schema (Figure 5). The shipping table
has five dimensions, or keys—item key, time key, shipper key, from location, and to location— and two
measures—dollars cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact tables. For
example, the dimensions tables for time, item, and location are shared between the sales and shipping fact
tables.
Concept Hierarchies
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts. Consider a concept hierarchy for the dimension location.
Top-Down Approach
In this approach the data in the data warehouse is stored at the lowest level of granularity
based on a normalized data model. The centralized data warehouse would feed the dependent data marts
that may be designed based on a dimensional data model.
• The advantages of this approach are:
• A truly corporate effort, an enterprise view of data
• Inherently architected, not a union of disparate data marts
• Single, central storage of data about the content
• Centralized rules and control
• May see quick results if implemented with iterations
• The disadvantages are:
• Takes longer to build even with an iterative method
• High exposure to risk of failure
• Needs high level of cross-functional skills
• High outlay without proof of concept
Bottom-Up Approach
In this approach data marts are created first to provide analytical and reporting capabilities
for specific business subjects based on the dimensional data model.
• Holistic: An aggregate function is holistic if there is no constant bound on the storage size needed
to describe a sub-aggregate. That is, there does not exist an algebraic function with M arguments
(where M is a constant) that characterizes the computation. Common examples of holistic
functions include median(), mode(), and rank(). A measure is holistic if it is obtained by applying
a holistic aggregate function.
• The top-down view allows the selection of the relevant information necessary for the data
warehouse. This information matches current and future business needs.
• The data source view exposes the information being captured, stored, and managed
by operational systems. This information may be documented at various levels of detail
and accuracy, from individual data source tables to integrated data source tables. Data
sources are often modeled by traditional data modeling techniques, such as the entity-
relationship model or CASE (computer-aided software engineering) tools.
• The data warehouse view includes fact tables and dimension tables. It represents the information
that is stored inside the data warehouse, including pre-calculated totals and counts, as well
as information regarding the source, date, and time of origin, added to provide historical context.
• Finally, the business query view is the data perspective in the data warehouse from the
end-
user’s viewpoint.
Building and using a data warehouse is a complex task because it requires business skills, technology
skills, and program management skills.
Regarding business skills, building a data warehouse involves understanding how systems store and
manage their data, how to build extractors that transfer data from the operational system to the
data warehouse, and how to build warehouse refresh software that keeps the data warehouse reasonably
up- to-date with the operating system’s data.
Technology skills, data analysts are required to understand how to make assessments
from quantitative information and derive facts based on conclusions from historic information in
the data warehouse. These skills include the ability to discover patterns and trends, extrapolate
trends based on history and look for anomalies or paradigm shifts, and to present coherent managerial
recommendations based on such analysis.
Finally, program management skills involve the need to interface with many technologies, vendors,
and end-users to deliver results in a timely and cost effective manner.
The goals of an initial data warehouse implementation should be specific, achievable, and measurable.
Once a data warehouse is designed and constructed, the initial deployment of the
warehouse includes initial installation, roll-out planning, training, and orientation. Data warehouse
administration includes data refreshment, data source synchronization, planning for disaster recovery,
managing access control and security, managing data growth, managing database performance, and
data warehouse enhancement and extension.
Various kinds of data warehouse design tools are available. Datawarehouse development tools
provide functions to define and edit metadata repository contents (e.g., schemas, scripts, or rules), answer
queries, output reports, and ship metadata to and from relational database system catalogs. Planning and
analysis tools study the impact of schema changes and refresh performance when changing refresh rates
or time windows.
Data Warehouse Usage for Information Processing
Data warehouses and data marts are used in a wide range of applications. There are three kinds of
data warehouse applications: information processing, analytical processing, and data mining.
Information processing supports querying, basic statistical analysis, and reporting using crosstabs,
tables, charts, or graphs. A current trend in data warehouse information processing is to construct low-
cost web-based accessing tools that are then integrated with web browsers.
Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and
pivoting. It generally operates on historic data in both summarized and detailed forms. The major strength
of online analytical processing over information processing is the multidimensional data analysis of data
warehouse data.
Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction, and presenting the mining
results using visualization tools. “How does data mining relate to information processing and
online analytical processing?” Information processing, based on queries, can find useful information.
However, answers to such queries reflect the information directly stored in databases or computable by
aggregate functions. They do not reflect sophisticated patterns or regularities buried in the database.
Therefore, information processing is not data mining.
Depending on the type and capacities of a warehouse, it can become home to structured, semi-structured,
or unstructured data.
Structured data is highly-organized and commonly exists in a tabular format like Excel files.
Unstructured data comes in all forms and shapes from audio files to PDF documents and doesn’t
have a pre-defined structure.
Semi-structured data is somewhere in the middle, meaning it is partially structured but doesn't
fit the tabular models of relational databases. Examples are JSON, XML, and Avro files.
1. The ability to identify the data in the data source environment that can be read by the
tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in many installations.
4. The specification interface to indicate the information to be extracted and conversation are
essential.
5. The ability to read information from repository products or data dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to extract only the required
data.
8. A field-level data examination for the transformation of data into information is needed.
9. The ability to perform data type and the character-set translation is a requirement when moving
data between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and records are necessary.
11. Vendor stability and support for the products are components that must be evaluated carefully.
1. 1) Business User: Business users require a data warehouse to view summarized data from
the past. Since these people are non-technical, the data may be presented to them in an elementary
form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a commonplace,
the user can effectively undertake to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick response time.
o They occasionally need to provide complete information before a request for information, which
also costs the organization money.
o Between data warehouses and operational systems, there is frequently a fine line. It is necessary to
determine which of these features can be used and which ones should be implemented in the data
warehouse since it would be expensive to carry out needless activities or to stop carrying
out those that would be required.
o It could be more useful for making decisions in real-time due to the prolonged processing time it
may require. In any event, the trend of modern products (along with technological advancements)
addresses this issue by turning the drawback into a benefit.
o Regarding the various objectives a company seeks to achieve, challenges may arise during
implementation.
o It might be challenging to include new data sources once a system has been implemented.
o They necessitate an examination of the data model, objects, transactions, and storage.
o They were designed in a sophisticated, multidisciplinary manner.
o The operating systems must be reorganized to accommodate them.
o Data centers are excellent systems for maintenance. Any source systems and business
process restructuring could influence the data warehouse, resulting in significant maintenance
costs.
o The data warehouse may seem simple, but it is too complex for the typical person to comprehend.
o The scope of the data storage project will start to expand, despite the best efforts of
project management.
o At this point, various business regulations may already be in place for warehouse clients.
o Uniformization of data Similar data formats in many data sources are another topic data
warehousing covers. The loss of some important data components could be the outcome.
1. Production Data This category of data comes from the various operational systems of the enterprise.
These normally include financial systems, manufacturing systems, systems along the supply chain, and
customer relationship management systems. Based on the information requirements in the
data warehouse, you choose segments of data from the different operational systems.
2. Internal Data In every organization, users keep their “private” spreadsheets, documents,
customer profiles, and sometimes even departmental databases. This is the internal data, parts of which
could be useful in a data warehouse.
3. Archived Data Operational systems are primarily intended to run the current business. In
every operational system, you periodically take the old data and store it in archived files. The
circumstances in your organization dictate how often and which portions of the operational
databases are archived for storage. Some data is archived after a year. Sometimes data is left in the
operational system databases for as long as five years. Much of the archived data comes from old legacy
systems that are nearing the end of their useful lives in organizations.
4. External Data Most executives depend on data from external sources for a high percentage of
the information they use. They use statistics relating to their industry produced by external
agencies and national statistical offices. They use market share data of competitors. They use
standard values of financial indicators for their business to check on their performance.
For example, the data warehouse of a car rental company contains data on the current production
schedules of the leading automobile manufacturers. This external data in the data warehouse helps the car
rental company plan for its fleet management.
Data Staging Component
After you have extracted data from various operational systems and from external sources, you have
to prepare the data for storing in the data warehouse. The extracted data coming from several disparate
sources needs to be changed, converted, and made ready in a format that is suitable to be stored
for querying and analysis. Three major functions need to be performed for getting the data ready. You
have to extract the data, transform the data, and then load the data into the data warehouse storage.
1. Data Extraction: This function has to deal with numerous data sources. You have to employ
the appropriate technique for each data source. Source data may be from different source machines in
diverse data formats. Part of the source data may be in relational database systems. Some data may be on
other legacy network and hierarchical data models.
Many data sources may still be in flat files. You may want to include data from spreadsheets and
local departmental data sets. Data extraction may become quite complex. Tools are available on
the market for data extraction. You may want to consider using outside tools suitable for certain data
sources.
2. Data Transformation: Data transformation involves many forms of combining pieces of data from
the different sources. You combine data from single source record or related data elements from many
source records. On the other hand, data transformation also involves purging source data that is not useful
and separating out source records into new combinations. Sorting and merging of data takes place on a
large scale in the data staging area.
3. Data Loading: Two distinct groups of tasks form the data loading function. When you complete the
design and construction of the data warehouse and go live for the first time, you do the initial loading of
the data into the data warehouse storage. The initial load moves large volumes of data using up
substantial amounts of time. As the data warehouse starts functioning, you continue to extract the changes
to the source data, transform the data revisions, and feed the incremental data revisions on an ongoing
basis.
Data Storage Component
The data storage for the data warehouse is a separate repository. The operational systems of
your enterprise support the day-to-day operations. These are online transaction processing
applications. The data repositories for the operational systems typically contain only the current
data. Also, these data repositories contain the data structured in highly normalized formats for fast and
efficient processing.
Generally, the database in your data warehouse must be open. Depending on your requirements, you
are likely to use tools from multiple vendors. The data warehouse must be open to different tools. Most of
the data warehouses employ relational database management systems.
Many data warehouses also employ multidimensional database management systems. Data extracted
from the data warehouse storage is aggregated in many ways and the summary data is kept in
the multidimensional databases (MDDBs). Such multidimensional database systems are usually
proprietary products.
In order to provide information to the wide community of datawarehouse users, the information
delivery component includes different methods of information delivery. Figure 2-9 shows the different
information delivery methods. Ad hoc reports are predefined reports primarily meant for novice
and casual users. Provision for complex queries, multidimensional (MD) analysis, and statistical
analysis cater to the needs of the business analysts and power users. Information fed into executive
information systems (EIS) is meant for senior executives and high-level managers. Some data
warehouses also provide data to data-mining applications
Metadata Component
Metadata in a data warehouse is similar to the data dictionary or the data catalog in a
database management system. In the data dictionary, you keep the information about the logical data
structures, the information about the files and addresses, the information about the indexes, and so
on. The data dictionary contains data about the data in the database. Similarly, the metadata
component is the data about the data in the data warehouse.
Types of Metadata
Metadata in a data warehouse fall into three major categories:
† Operational metadata
† Extraction and transformation metadata
† End-user metadata
1.Operational Metadata As you know, data for the data warehouse comes from several
operational systems of the enterprise. These source systems contain different data structures. The
data elements selected for the data warehouse have various field lengths and data types. In
selecting data from the source systems for the data warehouse, you split records, combine parts of
records from different source files, and deal with multiple coding schemes and field lengths. When you
deliver information to the end- users, you must be able to tie that back to the original source data sets.
Operational metadata contain all of this information about the operational data sources. Extraction
and Transformation Metadata Extraction and transformation metadata contain data about the extraction
of data from the source systems, namely, the extraction frequencies, extraction methods, and business
rules for the data extraction. Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.
2. End-User Metadata The end-user metadata is the navigational map of the data warehouse. It enables
the end-users to find information from the data warehouse. The end-user metadata allows the end-users to
use their own business terminology and look for information in those ways in which they normally think
of the business
1.3 Operational database Vs Data warehouse
3.Federated
Some companies get into data warehousing with an existing legacy of an assortment of decision-
support structures in the form of operational systems, extracted datasets, primitive data marts, and so on.
For such companies, it may not be prudent to discard all that huge investment and start from scratch. The
practical solution is a federated architectural type where data may be physically or logically integrated
through shared key fields, overall global metadata, distributed queries, and such other methods. In this
architectural type, there is no one overall data warehouse.
4.Hub-and-Spoke
This is the Inmon Corporate Information Factory approach. Similar to the centralized data
warehouse architecture, here too is an overall enterprise-wide data warehouse. Atomic data in the third
normal form is stored in the centralized data warehouse. The major and useful difference is the presence
of dependent data marts in this architectural type. Dependent data marts obtain data from the centralized
data warehouse. The centralized data warehouse forms the hub to feed data to the data marts on
the spokes. The dependent data marts may be developed for a variety of purposes: departmental
analytical needs, specialized queries, data mining, and so on. Each dependent dart mart may have
normalized, denormalized, summarized, or dimensional data structures based on individual
requirements. Most queries are directed to the dependent data marts although the centralized data
warehouse may itself be used for querying. This architectural type results from adopting a top-down
approach to data warehouse development.
5.Data-Mart Bus
This is the Kimbal conformed supermarts approach. You begin with analyzing requirements for a
specific business subject such as orders, shipments, billings, insurance claims, car rentals, and so on. You
build the first data mart (supermart) using business dimensions and metrics. These business dimensions
will be shared in the future data marts. The principal notion is that by conforming dimensions among the
various data marts, the result would be logically integrated supermarts that will provide an
enterprise view of the data. The data marts contain atomic data organized as a dimensional data
model. This architectural type results from adopting an enhanced bottom-up approach to datawarehouse
development.
Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially
to corporate relational databases or legacy databases, or it may come from an information system outside
the corporate walls.
Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and
fill gaps, and integrated to merge heterogeneous sources into one standard schema. The sonamed
Extraction, Transformation, and Loading Tools (ETL) can combine heterogeneous schemata, extract,
transform, cleanse, validate, filter, and load source data into a data warehouse.
Data Warehouse layer: Information is saved to one logically centralized individual repository: a
data warehouse. The data warehouses can be directly accessed, but they can also be used as a
source for creating data marts, which partially replicate data warehouse contents and are designed
for specific enterprise departments. Meta-data repositories store information on sources, access
procedures, data staging, users, data mart schema, and so on.
Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically
analyze information, and simulate hypothetical business scenarios. It should featureaggregate information
navigators, complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source systems), the
reconciled layer, and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a
whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of the data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that cannot
be satisfactorily prepared using the corporate applications or generating data flows to feed external
processes periodically to benefit from cleaning and integration.
This architecture is especially useful for extensive, enterprise-wide systems. A disadvantage
of this structure is the extra file storage space used through the extra redundant reconciled layer.
It also makes the analytical tools a little further away from being real-time.
4-Tier Architecture
User:At the end-user layer, data in the ODS, data warehouse, and data marts can be accessed by
using a variety of tools such as query and reporting tools, data visualization tools, and analytical
applications.
Presentation layer: Its functions contain receiving data inputted, interpreting users' instructions,
and sending requests to the data services layer, and displaying the data obtained from the data services
layer to users by the way they can understand. It closest to users and provide an interactive
operation interface.
Business logic: It is located between the PL and data access layer playing a connecting role in the
data exchange. The layer’s concerns are focused primarily on the development of business rules, business
processes, and business needs related to the system.It’s also known as the domain layer.
Data Access: It is located in the innermost layer that implements persistence logic
and responsible for access to the database. Operations on the data contain finding, adding,
deleting, modifying, etc.
This level works independently, without relying on other layers.DAL extracts the appropriate data
from the database and passes the data to the upper.
2. Load Processing
Many phases must be taken to load new or update data into the data warehouse, including
data conversion, filtering, reformatting, indexing, and metadata update.
3. Data Quality Management
Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.
4. Query Performance
Fact-based management must not be slowed by the performance of the data warehouse
RDBMS;
large, complex queries must be complete in seconds, not days.
5.Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds
of gigabytes and terabyte-sized data warehouses.
5. Real-time Analytics: Combining real-time data streams with ADW for near real-time analytics.
6. Customer Analytics: Analyzing customer data to understand behavior and preferences.
7. Predictive Analytics: Building and training predictive models for forecasting and data-driven
insights.
8. Financial Analytics: Analyzing financial data for budgeting, forecasting, and performance
analysis.
9. IoT Data Analysis: Analyzing data from Internet of Things (IoT) devices to derive insights.
10. Compliance and Reporting: Storing historical data for compliance and reporting purposes.
What are the feature of Oracle Autonomous Data Warehouse?
Self-Driving: Automated database tuning and optimization for better performance and reduced
manual tasks.
Self-Securing: Automated security measures to protect data and prevent unauthorized access.
Self-Repairing: Automatic error detection and resolution to ensure high availability.
Scalability: ADW can scale compute and storage resources independently to match workload
demands.
In-Memory Processing: Utilizes in-memory columnar processing for faster query performance.
Parallel Execution: Queries are processed in parallel across multiple nodes for faster results.
Integration with Oracle Ecosystem: Seamless integration with other Oracle Cloud services and
tools.
Data Encryption: Provides data encryption both at rest and in transit for data security.
Easy Data Loading: Supports data loading from various sources, including Oracle Data Pump,
SQL Developer, and SQL*Loader.
Pay-as-You-Go Pricing: Based on consumption, offering cost-effective pricing.
Oracle Autonomous Data Warehouse is built on Oracle Exadata, which is a highly optimized platform
for data warehousing and analytics.
1. Storage Layer: Data is stored in Exadata storage servers using a combination of flash and disk
storage.
2. Compute Layer: The compute nodes are responsible for processing queries and analyzing data.
ADW uses a massively parallel processing (MPP) architecture to parallelize queries across
multiple nodes for faster performance.
3. Autonomous Features: ADW leverages AI and machine learning to automate various
administrative tasks, including performance tuning, security patching, backups, and
fault detection.
Snowflake Data Cloud: Snowflake’s unique multi-cluster shared data architecture separates
compute and storage resources, enabling independent scaling of each component. This decoupled
architecture allows organizations to allocate resources based on their needs and budget, providing greater
flexibility and cost control. Additionally, Snowflake’s architecture supports near-infinite
concurrency, enabling multiple users and applications to access the same data simultaneously.
Main Differences
Technical Differences
1. Extract: The first stage in the ETL process is to extract data from various sources such
as transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.
2. Transform: In this stage, the extracted data is transformed into a format that is suitable
for loading into the data warehouse. This may involve cleaning and validating the data,
converting data types, combining data from multiple sources, and creating new data fields.
3. Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the warehouse.
The ETL process is an iterative process that is repeated as new data is added to the warehouse. The
process is important because it ensures that the data in the data warehouse is accurate, complete, and up-
to-date. It also helps to ensure that the data is in the format required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as
Informatica, Talend, DataStage, and others, that can automate and simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process
in which an ETL tool extracts the data from various data source systems, transforms it in the staging area,
and then finally, loads it into the Data Warehouse system.
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse
builder, CloverETL, and MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift, BigQuery, and
Firebolt.
Overall, ETL process is an essential process in data warehousing that helps to ensure that the data
in the data warehouse is accurate, complete, and up-to-date. However, it also comes with its own set
of challenges and limitations, and organizations need to carefully consider the costs and
benefits before implementing them.
ELT ETL
As all components are in one system, As ETL uses staging area, extra time is
loading is done only once required to load the data
It is cost effective and available to all Not cost effective for small and medium
business using SaaS solution business
The data transformed is used by data The data transformed is used by users
scientists and advanced analysts reading report and SQL coders
Creates ad hoc views.Low cost for Views are created based on multiple
building and maintaining scripts.Deleting view means deleting data
Data Extraction and transformation tools allow the automated extraction and cleaning of data
from production systems. It is not applicable to enable direct access by query tools to these categories
of methods for the following reasons:
1. A huge load of complex warehousing queries would possibly have too much of a harmful impact
upon the mission-critical transaction processing (TP)-oriented application.
2. These TP systems have been developing in their database design for transaction throughput. In all
methods, a database is designed for optimal query or transaction processing. A complex business
query needed the joining of many normalized tables, and as result performance will usually
be poor and the query constructs largely complex.
3. There is no assurance that data in two or more production methods will be consistent.
Designed for the workgroup environment, a LAN based workgroup warehouse is optimal for any
business organization that wants to build a data warehouse often called a data mart. This type of data
warehouse generally requires a minimal initial investment and technical training.
Data Delivery: With a LAN based workgroup warehouse, customer needs minimal technical
knowledge to create and maintain a store of data that customized for use at the department, business
unit, or workgroup level. A LAN based workgroup warehouse ensures the delivery of
information from corporate resources by providing transport access to the data in the warehouse.
A LAN based warehouse provides data from many sources requiring a minimal
initial investment and technical knowledge. A LAN based warehouse can also work replication
tools for populating and updating the data warehouse. This type of warehouse can include
business views, histories, aggregation, versions in, and heterogeneous source support, such as
o DB2 Family
o IMS, VSAM, Flat File [MVS and VM]
A single store frequently drives a LAN based warehouse and provides existing DSS applications,
enabling the business user to locate data in their data warehouse. The LAN based warehouse
can support business users with complete data to information solution. The LAN based
warehouse can also share metadata with the ability to catalog business data and make it
feasible for anyone who needs it.
This strategy defines that end users are allowed to get at operational databases directly
using whatever tools are implemented to the data access network. This method provides ultimate
flexibility as well as the minimum amount of redundant information that must be loaded and
maintained. The data warehouse is a great idea, but it is difficult to build and requires investment.
Why not use a cheap and fast method by eliminating the transformation phase of repositories for
metadata and another database. This method is termed the 'virtual data warehouse.'
Disadvantages
1. Since queries compete with production record transactions, performance can be degraded.
2. There is no metadata, no summary record, or no individual DSS (Decision Support System)
integration or history. All queries must be copied, causing an additional burden on the system.
3. There is no refreshing process, causing the queries to be very complex.
2.4 Data warehouse Design and Modeling
Data Warehouse Design
A data warehouse is a single data repository where a record from multiple data sources is
integrated for online business analytical processing (OLAP). This implies a data warehouse needs to meet
the requirements from all the business stages within the entire organization. Thus, data warehouse design
is a hugely complex, lengthy, and hence error-prone process. Furthermore, business analytical functions
change over time, which results in changes in the requirements for the systems. Therefore, data
warehouse and OLAP systems are dynamic, and the design process is continuous.
Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such as answering
management related queries. The target of the design becomes how the record from multiple data sources
should be extracted, transformed, and loaded (ETL) to be organized in a database as the data warehouse.
There are two approaches
1. "top-down" approach
2. "bottom-up" approach
The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart, a
data warehouse for a single subject, takes far less time and effort than developing an enterprise-wide data
warehouse. Also, the risk of failure is even less. This method is inherently incremental. This method
allows the project team to learn and grow.
Breaks the vast problem into smaller sub Solves the essential low-level problem and
problems. integrates them into a higher one.
Inherently architected- not a union of several data Inherently incremental; can schedule essential
marts. data marts first.
Data warehouse design is a business-wide journey. Data warehouses touch all areas of
your business, so every department needs to be on board with the design. Since your warehouse is
only as powerful as the data it contains, aligning departmental needs and goals with the overall project is
critical to your success.
So, if you currently can't combine all your sales data with all your marketing data, your overall
query results are missing some critical components. Knowing which leads are valuable can help you get
more value from your marketing data.
Every department needs to understand the purpose of the data warehouse, how it benefits them,
and what kinds of results they can expect from your warehousing solution.
Discovering your current and future needs by diving deep into your data (finding out what data is
useful for analysis) and your current tech stack (where your data is currently siloed and not being
used)
You can think of this as your overall data warehouse blueprint. But this phase is more about determining
your business needs, aligning those to your data warehouse, and, most importantly, getting everyone on
board with the data warehousing solution.
Data warehouses typically have three primary physical environments — development, testing, and
production. This mimics standard software development best practices, and your three environments exist
on completely separate physical servers.
Why do you need three separate environments?
You need a way to test changes before they move into the production environment.
Some security best practices require that testers and developers never have access to production
data.
Running tests against data typically uses extreme data sets or random sets of data from
the production environment — and you need a unique server to execute these tests en masse.
Having a development environment is a necessity, and dev environments exist in a unique state of
flux compared to production or test environments.
Production environments have much higher workloads (your whole business is using it), so trying
to run tests or develop in that environment can be stressful for both team members and servers.
Data integrity is much easier to track, and issues are easier to contain when you have
three environments running. It makes headhunting issues less stressful on your workloads,
and data flow in production and testing environments can be stalled without impacting end users.
Running tests can often introduce breakpoints and hang your entire server. That's not something
you want happening in your production environment.
Imagine sharing resources between production, testing, and development. You don’t want
that!
Testing, development, and production environments all have different resource needs, and trying
to combine all functions into one server can be catastrophic for performance.
Remember, BI development is an ongoing process that really never grinds to a halt. This is especially true
in Agile/DevOps approaches to the software development lifecycle, which all require
separate environments due to the sheer magnitude of constant changes and adaptations.
You can choose to run more than these three environments, and some business users choose to
add additional environments for specific business needs. Integrate.io has seen staging environments that
are separate from testing solely for quality assurance work, as well as demo and integration
environments specifically for testing integrations.
You should have these three core environments, but you can layer in additional settings to fit your unique
business goals.
3. Data Warehouse Design: Introducing Data Modeling
Data modeling is the process of visualizing data distribution in your warehouse. Think of it as a blueprint.
Before you start building a house, it's important to know what goes where and why it goes there. That's
what data modeling is to data warehouses.
The above benefits of data modeling help improve decision-making throughout your organization.
However, data modeling is probably the most complex phase of data warehouse design, and there
are multiple data modeling techniques businesses can choose from for warehouse design. Before
jumping into a few of the most popular data modeling techniques, let's take a look at the differences
A data warehouse is a system to store data in (or push data into) to run analytics and queries. A data mart,
on the other hand, is an area within a data warehouse that stores data for a specific business function.
So, say you've built your entire data warehouse. That's great! But does it account for how
different departments will use the data? Your sales team will use that data warehouse in a vastly different
way than your legal team. Plus, certain workflows and data sets are only valuable to certain teams. Data
marts are where all those team-specific data sets are stored, and related queries are processed.
Data modeling typically takes place at the data mart level and branches out into your data warehouse. It's
the logic behind how you store certain data in relation to other data.
1. Snowflake schema
2. Star schema
3. Galaxy schema
You should choose and develop a data model to guide your overall data architecture within your
warehouse. The model you choose will impact the structure of your data warehouse and data marts —
which impacts the ways that you utilize ETL tools like Integrate.io and run queries on that data.
ETL or Extract, Transform, Load is the process used to pull data out of your current tech stack or
existing storage solutions and put it into your warehouse. It goes something like this:
You extract data from a source system and place it into a staging area.
You transform that data into the best format for data analytics. You also remove any duplicated
data or inconsistencies that can make analysis difficult.
You then load the data to a data warehouse before pushing it through BI tools like Tableau and
Looker.
Normally, ETL is a complicated process that requires manual pipeline-building and lots of code. Building
these pipelines can take weeks or even months and might require a data engineering team. That’s where
ETL solutions come in. They automate many tasks associated with this data management and integration
process, freeing up resources for your team.
You should pay careful attention to the ETL solution you use so you can improve business
decisions. Since ETL is responsible for the bulk of the in-between work, choosing a subpar tool or
developing a poor ETL process can break your entire warehouse. You want optimal speeds, high
availability, good visualization, and the ability to build easy, replicable, and consistent data
pipelines between all your existing architecture and your new warehouse.
This is where ETL tools like Integrate.io are valuable. Integrate.io creates hyper-visualized data pipelines
between all your valuable tech architecture while cleaning and nominalizing that data for compliance and
ease of use.
Remember, a good ETL process can mean the difference between a slow, painful-to-use data warehouse
and a simple, functional warehouse that's valuable throughout every layer of your organization.
ETL will likely be the go-to for pulling data from systems into your warehouse. Its counterpart Extract,
Load, Transfer (ELT) negatively impacts the performance of most custom-built warehouses since data is
loaded directly into the warehouse before data organization and cleansing occur. However, there might be
other data integration use cases that suit the ELT process. Integrate.io not only executes ETL but
can handle ELT, Reverse ETL, and Change Data Capture (CDC), as well as provide data observability
and data warehouse insights.
5. Online Analytic Processing (OLAP) Cube
OLAP (Online Analytical Processing) cubes are commonly used in the data warehousing process
to enable faster, more efficient analysis of large amounts of data. OLAP cubes are based
on multidimensional databases that store summarized data and allow users to quickly analyze
information from different dimensions.
Here's how an OLAP cube fits into the data warehouse design:
OLAP cubes are designed to store pre-aggregated data that has been processed from
various sources in a data warehouse. The data is organized into a multi-dimensional structure that
enables users to view and analyze it from different perspectives.
OLAP cubes are created using a process called cube processing, which involves aggregating and
storing data in a way that enables fast retrieval and analysis. Cube processing can be performed on
a regular basis to ensure that the data is up-to-date and accurate.
OLAP cubes enable users to perform complex analytical queries on large volumes of data in real-
time, making it easier to identify trends, patterns, and anomalies. Users can also slice and
dice data in different ways to gain deeper insights into their business operations.
OLAP cubes support drill-down and roll-up operations, which allow users to navigate
through different levels of data granularity. Users can drill down to the lowest level of
detail to view individual transactions or roll up to higher levels of aggregation to view summary
data.
OLAP cubes can be accessed using a variety of tools, including spreadsheets, reporting tools, and
business intelligence platforms. Users can create reports and dashboards that display the data in a
way that is meaningful to them.
You'll likely need to address OLAP cubes if you're designing your entire database from scratch,
or if you're maintaining your own OLAP cube — which typically requires specialized personnel.
So, if you plan to use a vendor warehouse solution (e.g., Redshift or BigQuery) you probably won't need
an OLAP cube (cubes are rarely used in either of those solutions*.)
If you have a set of BI tools requiring an OLAP cube for ad-hoc reporting, you may need to develop one
or use a vendor solution.
Here are the differences between a data warehouse and OLAP cubes:
A data warehouse is where you store your business data in an easily analyzable format to be used
for a variety of business needs.
Online Analytic Processing cubes help you analyze the data in your data warehouse or data mart.
Most of the time, OLAP cubes are used for reporting, but they have plenty of other use cases.
Since your data warehouse will have data coming in from multiple data pipelines, OLAP cubes help you
organize all that data in a multi-dimensional format that makes analyzing it rapid and
straightforward. OLAP cubes are a critical component of data warehouse design because they provide
fast and efficient access to large volumes of data, enabling users to make informed business decisions
based on insights derived from the data.
So far, this guide has only covered back-end processes. There needs to be front-end visualization,
so users can immediately understand and apply the results of data queries.
That's the job of your front end. There are plenty of tools on the market that help with visualization. BI
tools like Tableau (or PowerBI for those using BigQuery) are great for visualization. You can
also develop a custom solution — though that's a significant undertaking.
Most small-to-medium-sized businesses lean on established BI kits like those mentioned above.
But, some businesses may need to develop their own BI tools to meet ad-hoc analytic needs. For example,
a Sales Ops manager at a large company may need a specific BI tool for territory strategies.
This tool would probably be custom-developed given the scope of the company’s sales objectives.
7. Optimizing Queries
Optimizing queries is a critical part of data warehouse design. One of the primary goals
of building a data warehouse is to provide fast and efficient access to data for decision-making. During
the design process, data architects need to consider the types of queries that users will be running and
design the data warehouse schema and indexing accordingly.
Optimizing your queries is a complex process that's hyper-unique to your specific needs. But there
are some general rules of thumb.
Understand the limitations of your OLAP vendor. BigQuery uses a hybrid SQL language, and
RedShift is built on top of a Postgre fork. Knowing the little nuances baked into your vendor can
help you maximize workflows and speed up queries.
1. Identifying the target audience: This involves determining which groups or individuals within
the organization will benefit from using the data warehouse.
2. Determining the data requirements: This involves identifying the types of data that the target
audience needs access to and ensuring that this data is available within the data warehouse.
3. Developing user-friendly interfaces: This involves creating user interfaces that are intuitive and
easy to use, and that provide users with the ability to interact with the data in meaningful ways.
4. Testing and refining: This involves conducting user testing to ensure that the data
warehouse meets the needs of its users, and making adjustments as necessary.
5. Training users: This involves providing training and support to users to help them
understand how to use the data warehouse effectively.
6. Deploying the data warehouse: This involves introducing the data warehouse to its
intended users, and ensuring that the rollout process goes smoothly.
By establishing a rollout plan, organizations can ensure that their data warehouse is
introduced effectively and that users are able to make the most of the valuable data that it contains.
2.4.1 Modeling
Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling is to develop
a schema describing the reality, or at least a part of the fact, which the data warehouse is needed to
support.
Data warehouse modeling is an essential stage of building a data warehouse for two main reasons.
Firstly, through the schema, data warehouse clients can visualize the relationships among the warehouse
data, to use them with greater ease. Secondly, a well-designed schema allows an effective data warehouse
structure to emerge, to help decrease the cost of implementing the warehouse and improve the efficiency
of using it.
Data modeling in data warehouses is different from data modeling in operational database
systems. The primary function of data warehouses is to support DSS processes. Thus, the objective of
data warehouse modeling is to make the data warehouse efficiently support complex queries on long term
information.
In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover,
data warehouses are designed for the customer with general information knowledge about the
enterprise, whereas operational database systems are more oriented toward use by software specialists for
creating distinct applications.
The data within the specific warehouse itself has a particular architecture with the emphasis on
various levels of summarization, as shown in figure:
The current detail record is central in importance as it:
o Reflects the most current happenings, which are commonly the most stimulating.
o It is numerous as it is saved at the lowest method of the Granularity.
o It is always (almost) saved on disk storage, which is fast to access but expensive and difficult to
manage.
Older detail data is stored in some form of mass storage, and it is infrequently accessed and kept at a
level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the current,
detailed level and usually is stored on disk storage. When building the data warehouse have to remember
what unit of time is summarization done over and also the components or what attributes the summarized
data will contain. Highly summarized data is compact and directly available and can even be found
outside the warehouse. Metadata is the final element of the data warehouses and is really of various
dimensions in which it is not the same as file drawn from the operational data, but it is used as:-
o A directory to help the DSS investigator locate the items of the data warehouse.
o A guide to the mapping of record as the data is changed from the operational data to the data
warehouse environment.
o A guide to the method used for summarization between the current, accurate data and the lightly
summarized information and the highly summarized data, etc.
Data Modeling Life Cycle
In this section, we define a data modeling life cycle. It is a straight forward process of
transforming the business requirements to fulfill the goals for storing, maintaining, and accessing
the data within IT systems. The result is a logical and physical data model for an enterprise data
warehouse.
The objective of the data modeling life cycle is primarily the creation of a storage area for
business information. That area comes from the logical and physical data modeling stages, as
shown in Figure:
Conceptual Data Model
A conceptual data model recognizes the highest-level relationships between the different entities.
Characteristics of the conceptual data model
o It contains the essential entities and the relationships among them.
o No attribute is specified.
o No primary key is specified.
We can see that the only data shown via the conceptual data model is the entities that define the
data and the relationships between those entities. No other data, as shown through the conceptual
data model.
The phase for designing the logical data model which are as follows:
o Specify primary keys for all entities.
o List the relationships between different entities.
o List all attributes for each entity.
o Normalization.
o No data types are listed
Data Modelling: This is the second step in the development of the Data Warehouse. Data
Modelling is the process of visualizing data distribution and designing databases by fulfilling the
requirements to transform the data into a format that can be stored in the data warehouse.
For example, whenever we start building a house, we put all the things in the correct
position as specified in the blueprint. That’s what data modeling is for data warehouses. Data
modelling helps to organize data, creates connections between data sets, and it’s useful for
establishing data compliance and its security that line up with data warehousing goals. It is the
most complex phase of data warehouse development. And, there are many data modelling
techniques that businesses use for warehouse design. Data modelling typically takes place at the
data mart level and branches out in a data warehouse. It’s the logic of how the data is stored
concerning other data. There are three data models for data warehouses:
Star Schema
Snowflake Schema
Galaxy Schema.
ELT Design and Development: This is the third step in the development of the Data Warehouse.
ETL or Extract, Transfer, Load tool may extract data from various source systems and store it in a
data lake. An ETL process can extract the data from the lake, after that transform it and load it
into a data warehouse for reporting. For optimal speeds, good visualization, and the ability
to build easy, replicable, and consistent data pipelines between all of the existing architecture
and the new data warehouse, we need ELT tools. This is where ETL tools like
SAS Data Management, IBM Information Server, Hive, etc. come into the picture. A good ETL
process can be helpful in constructing a simple yet functional data warehouse that’s valuable
throughout every layer of the organization.
OLAP Cubes: This is the fourth step in the development of the Data Warehouse. An OLAP cube,
also known as a multidimensional cube or hypercube, is a data structure that allows fast analysis
of data according to the multiple dimensions that define a business problem. A data warehouse
would extract information from multiple data sources and formats like text files, excel
sheets, multimedia files, etc. The extracted data is cleaned and transformed and is loaded into an
OLAP server (or OLAP cube) where information is pre-processed in advance for further
analysis. Usually, data operations and analysis are performed using a simple spreadsheet, where
data values are arranged in row and column format. This is ideal for two-dimensional data.
However, OLAP contains multidimensional data, with data typically obtained from different and
unrelated sources. Employing a spreadsheet isn’t an optimum choice. The cube will
store and analyze multidimensional data in a logical and orderly manner. Now, data warehouses
are now offered as a fully built product that is configurable and capable of staging multiple
types of data. OLAP cubes are becoming outdated as OLAP cubes can’t deliver real-time
analysis and reporting, as businesses are now expecting something with high performance.
UI Development: This is the fifth step in the development of the Data Warehouse. So far,
the processes discussed have taken place at the backend. There is a need for a user interface for
how the user and a computer system interact, in particular the use of input devices and
software, to immediately access the data warehouse for analysis and generating reports. The main
aim of a UI is to enable a user to effectively manage a device or machine they’re interacting with.
There are plenty of tools in the market that helps with UI development. BI tools like Tableau or
PowerBI for those using BigQuery are great choices.
Maintenance: This is the sixth step in the development of the Data Warehouse. In this phase, we
can update or make changes to the schema and data warehouse’s application domain or
requirements. Data warehouse maintenance systems must provide means to keep track of schema
modifications as well, for instance, modifications. At the schema level, we can perform operations
for the Insertion, and change dimensions and categories. Changes are, for example, adding
or deleting user-defined attributes.
Test and Deployment: This is often the ultimate step in the Data Warehouse development cycle.
Businesses and organizations test data warehouses to ensure whether the required business
problems are implemented successfully or not. The warehouse testing involves the scrutiny
of enormous volumes of data. Data that has to be compared comes from heterogeneous data
sources like relational databases, flat files, operational data, etc. The overall data warehouse
project testing phases include: Data completeness, Data Transformation, Data is loaded by means
of ETL tools, Data integrity, etc. After testing the data warehouse, we deployed it so that
users could
immediately access the data and perform analysis. Basically, in this phase, the data warehouse is turned
on and lets the user take the benefit of it. At the time of data warehouse deployment, most of its functions
are implemented. The data warehouses can be deployed at their own data center or on the cloud.
IT Strategy: Data warehouse project must contain IT strategy for procuring and retaining funding.
Business Case Analysis: After the IT strategy has been designed, the next step is the business case. It is
essential to understand the level of investment that can be justified and to recognize the
projected business benefits which should be derived from using the data warehouse.
Education & Prototyping: Company will experiment with the ideas of data analysis and educate
themselves on the value of the data warehouse. This is valuable and should be required if this is
the company first exposure to the benefits of the DS record. Prototyping method can progress the growth
of education. It is better than working models. Prototyping requires business requirement,
technical blueprint, and structures.
Technical blueprint: It arranges the architecture of the warehouse. Technical blueprint of the delivery
process makes an architecture plan which satisfies long-term requirements. It lays server and data mart
architecture and essential components of database design.
Building the vision: It is the phase where the first production deliverable is produced. This stage will
probably create significant infrastructure elements for extracting and loading information but limit them
to the extraction and load of information sources.
History Load: The next step is one where the remainder of the required history is loaded into the data
warehouse. This means that the new entities would not be added to the data warehouse, but additional
physical tables would probably be created to save the increased record volumes.
AD-Hoc Query: In this step, we configure an ad-hoc query tool to operate against the data warehouse.
These end-customer access tools are capable of automatically generating the database query that answers
any question posed by the user.
Automation: The automation phase is where many of the operational management processes are fully
automated within the DWH. These would include:
Extending Scope: In this phase, the scope of data warehouse is extended to address a new set
of business requirements. This involves the loading of additional data sources into the data warehouse
i.e. the introduction of new data marts.
Requirement Evolution: This is the last step of the delivery process of a data warehouse. As we
all know that requirements are not static and evolve continuously. As the business requirements will
change it supports to be reflected in the system.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-
Models, Knowledge Discovery, and Data Warehouses Reporting.
OLAP enables end-clients to perform ad hoc analysis of record in multiple dimensions, providing the
insight and understanding they require for better decision making.
1) Multidimensional Conceptual View: This is the central features of an OLAP system. By needing a
multidimensional view, it is possible to carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing operations,
and the dissimilar nature of source data totally transparent to users. Such transparency helps to improve
the efficiency and productivity of the users.
3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP system must
map its own logical schema to the heterogeneous physical data stores and perform any necessary
transformations. The OLAP operations should be sitting between data sources (e.g., data warehouses) and
an OLAP front- end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant
degradation in documenting performance as the number of dimensions or the size of the database
increases. That is, the performance of OLAP should not suffer as the number of dimensions is increased.
Users must observe consistent run time, response time, or machine utilization every time a given query is
run.
5) Client/Server Architecture: Make the server component of OLAP tools sufficiently intelligent that
the various clients to be attached with a minimum of effort and integration programming. The
server should be capable of mapping and consolidating data between dissimilar databases.
6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in both
is structure and operational capabilities. Additional operational capabilities may be allowed to
selected dimensions, but such additional tasks should be grantable to any dimension.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical
model being created and loaded that optimizes sparse matrix handling. When encountering the sparse
matrix, the system must be easy to dynamically assume the distribution of the information and adjust the
storage and access to obtain and maintain a consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.
9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to identify
dimensional order and necessarily functions roll-up and drill-down methods within a dimension or across
the dimension.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction like
as reorientation (pivoting), drill-down and roll-up, and another manipulation to be accomplished
naturally and precisely via point-and-click and drag and drop methods on the cells of the scientific model.
It avoids the use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns, rows, and
cells in a manner that facilitates simple manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should be
unlimited. Each of these common dimensions must allow a practically unlimited number of
customer- defined aggregation levels within any given consolidation path.
Major advantages to using OLAP
Business-focused calculations: One of the reasons OLAP systems are so fast is that they
pre- aggregate variables that would otherwise have to be generated on the fly in a traditional
relational database system. The calculation engine is in charge of both data aggregation
and business computations. The analytic abilities of an OLAP system are independent of
how the data is portrayed. The analytic calculations are kept in the system’s metadata rather than
in each report.
Business-focused multidimensional data: To organize and analyze data, OLAP uses
a multidimensional technique. Data is arranged into dimensions in a multidimensional method,
with each dimension reflecting various aspects of the business. A dimension can be
defined as a characteristic or an attribute of a data set. Elements of each dimension share the
same common trait. Within the dimension, the elements are typically structured hierarchically.
Trustworthy data and calculations: Data and calculations are centralized in OLAP
systems, guaranteeing that all end users have access to a single source of data. All data is
centralized in a multidimensional database in some OLAP systems. Several others centralize
some data in a multidimensional database and link to data stored relationally. Other OLAP
systems are integrated into a data warehouse and store data in multiple dimensions within the
database.
Flexible, self-service reporting: Business users can query data and create reports with
OLAP systems using tools that are familiar to them.
Speed-of-thought analysis: End-user queries are answered faster by OLAP systems than by
relational databases that do not use OLAP technology. OLAP systems pre-aggregate data,
allowing for fast response time.
OLAP queries are usually performed in a separate system, i.e., a data warehouse.
Transferring Data to Data Warehouse:
Data warehouses aggregate data from a variety of sources.
Data must be converted into a systematic format.
In a typical data warehouse project, data integration takes up 80% of the effort.
Optimization of Data Warehouse:
Data storage can be either relational or multi-dimensional.
Additional data structures include sorting, indexing, summarizing, and cubes.
Refreshing of data structures.
Querying Multidimensional data:
SQL extensions.
Map-reduce-based languages.
Multidimensional Expressions (MDX).
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that
are achieved over a long period. For example, if we collect the last 10 years information about
flight reservation, the data can give us much meaningful data such as the trends in the reservation. This
may provide useful information like peak time of travel, what kind of people are traveling in various
classes (Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP manage many concurrent customers and queries touching only an
individual record or limited groups of files at a time. An OLAP system must have the capability
to operate on millions of files to answer a single query.
.
Feature OLTP OLAP
Characteristic It is a system which is used It is a system which is used to
to manage operational Data. manage informational Data.
Consider the OLAP operations which are to be performed on multidimensional data. The figure
shows data cubes for sales of a shop. The cube contains the dimensions, location, and time and
item, where the location is aggregated with regard to city values, time is aggregated with respect to
quarters, and an item is aggregated with respect to item types.
1. Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on
a data cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like
zooming- out on the data cubes. Figure shows the result of roll-up operations performed on the dimension
location.
The hierarchy for the location is defined as the Order Street, city, province, or state, country. The
roll-up operation aggregates the data by ascending the location hierarchy from the level of the city to the
level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from
the cube. For example, consider a sales data cube having two dimensions, location and time. Roll-up may
be performed by removing, the time dimensions, appearing in an aggregation of the total sales
by location, relatively than by location and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:
Temperature 64 65 68 69 70 71 72 75 80 81 83 85
Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature from the
above cubes.
To do this, we have to group column and add up the value according to the concept hierarchies.
This operation is known as a roll-up.
By doing this, we contain the following cube:
Week1 2 1 1
Week2 2 1 1
2.Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is
like zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-
down can be performed by either stepping down a concept hierarchy for a dimension or adding
additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a
concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears by descending
the time hierarchy from the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a
new dimension to a cube. For example, a drill-down on the central cubes of the figure can
occur by introducing an additional dimension, such as a customer group.
Example
Drill-down adds more details to the given data
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
The following diagram illustrates how Slice works.
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
4.Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cubes we get the following subcube (still two-dimensional)
Day 3 0 1
Day 4 0 0
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional techniques.
Relational OLAP (ROLAP) Server
These are intermediate servers which stand in between a relational back-end server and
user frontend tools.
They use a relational or extended-relational DBMS to save and handle warehouse data, and OLAP
middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.
ROLAP systems work primarily from the data that resides in a relational database, where the base
data and dimension tables are stored as relational tables. This model permits the multidimensional
analysis of data.
This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of slicing and
dicing is equivalent to adding a "WHERE" clause in the SQL statement.
o Database server.
o ROLAP server.
o Front-end tool.
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the
market. This method allows multiple multidimensional views of two-dimensional relational tables to be
created, avoiding structuring record around the desired view.
Some products in this segment have supported reliable SQL engines to help the complexity
of multidimensional analysis. This includes creating multiple SQL statements to handle user requests,
being
'RDBMS' aware and also being capable of generating the SQL statements based on the optimizer of the
DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is depends
on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data amount.
RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the RDBMS)
can control these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in
the relational database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements
to query the relational database, and SQL statements do not suit all needs.
A MOLAP system is based on a native logical model that directly supports multidimensional data
and operations. Data are stored physically into multidimensional arrays, and positional techniques
are used to access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and
are stored in an optimized format in a multidimensional cube, instead of in a relational database.
In MOLAP model, data are structured into proprietary formats by client's reporting requirements with
the calculations pre-generated on the cubes.
MOLAP Architecture
o Database server.
o MOLAP server.
o Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been pre-calculated
and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well suited
for MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship
Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject areas
in an RDBMS. Some vendors can solve these problems by continuing access from MOLAP tools
to detailed data in and RDBMS.
This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture
that contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product and sales
region) to be stored and maintained in a persistent structure. This structure would be provided to reduce
the application overhead of performing calculations and building aggregation during initialization. These
structures can be automatically refreshed at predetermined intervals established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for
slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated when the cube
is created. Hence, complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are
performed when the cube is built, it is not possible to contain a large amount of data in the cube
itself.
Requires additional investment: Cube technology is generally proprietary and does not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are other investments
in human and capital resources are needed.
HOLAP incorporates the best features of MOLAP and ROLAP into a single
architecture. HOLAP systems save more substantial quantities of detailed data in the relational
tables while the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through from
the cube down to the relational tables for delineated data. The Microsoft SQL Server 2000
provides a hybrid OLAP server.
Advantages of HOLAP
3. HOLAP balances the disk space requirement, as it only stores the aggregate information on the
OLAP server and the detail record remains in the relational database. So no duplicate copy of the
detail record is maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every so
often. We have listed some of the less popular brands existing in the OLAP industry.
WOLAP pertains to OLAP application which is accessible via the web browser. Unlike traditional
client/server OLAP applications, WOLAP is considered to have a three-tiered architecture which consists
of three components: a client, a middleware, and a database server.
DOLAP permits a user to download a section of the data from the database or source, and work
with that dataset locally, or on their desktop.
Mobile OLAP enables users to access and work on OLAP data and applications remotely through
the use of their mobile devices.
SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP into a
single user interface. It facilitates the management of both spatial and non-spatial data.
2.11 ROLAP Vs MOLAP Vs HOLAP
ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Analytical
Online Analytical Processing.
Analytical Processing. Processing.
The ROLAP storage The MOLAP storage mode principle The HOLAP storage mode
mode causes the the aggregations of the division and connects attributes of both
aggregation of the division a copy of its source information MOLAP and ROLAP.
to be stored in indexed to be saved in a Like MOLAP, HOLAP
views in the relational multidimensional operation in causes the aggregation of the
database that was specified analysis services when the division to be stored
in the partition's data separation is processed. in a multidimensional
source. operation in an SQL
Server analysis services
instance.
ROLAP does not because a This MOLAP operation is HOLAP does not causes
copy of the source highly optimize to maximize a copy of the
information to be stored query performance. The storage source information to be
in the Analysis services area can be on the computer stored. For queries that
data folders. Instead, when where the partition is described or access the only summary
the outcome cannot be on another computer running record in the aggregations
derived from the query Analysis services. Because a of a division, HOLAP is
cache, the indexed views in copy of the source information the equivalent of MOLAP.
the record source are resides in the
accessed to answer multidimensional operation, queries
queries. can be resolved without accessing
the partition's source record.
Query response is Query response times can be reduced Queries that access source
frequently slower with substantially by using record for example, if we
ROLAP storage than with aggregations. The record in the want to drill down to an
the MOLAP or HOLAP partition's MOLAP operation is only atomic cube cell for
storage mode. Processing as current as of the most recent which there is no
time is also frequently processing of the separation. aggregation information must
slower with ROLAP. retrieve data from the
relational database and will
not be as fast as they would
be if the source
information were stored in the
MOLAP architecture.
Difference between ROLAP and MOLAP
ROLAP MOLAP
ROLAP stands for Relational Online MOLAP stands for Multidimensional Online
Analytical Processing. Analytical
Processing.
It usually used when data warehouse It used when data warehouse contains relational as well
contains relational data. as non-relational data.
It has a high response time It has less response time due to prefabricated cubes.
Metadata is simply defined as data about data. The data that is used to represent other data is known as
metadata. For example, the index of a book serves as a metadata for the contents in the book. In other
words, we can say that metadata is the summarized data that leads us to detailed data. In terms of data
warehouse, we can define metadata as follows.
Metadata acts as a directory. This directory helps the decision support system to locate
the contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a given
data warehouse. Along with this metadata, additional metadata is also created for time-stamping any
extracted data, the source of extracted data.
Metadata can be stored in various forms, such as text, XML, or RDF, and can be organized
using metadata standards and schemas. There are many metadata standards that have been
developed to facilitate the creation and management of metadata, such as Dublin Core, schema.org, and
the Metadata Encoding and Transmission Standard (METS). Metadata schemas define the structure
and format of metadata and provide a consistent framework for organizing and describing data.
1. A library catalog may be considered metadata. The directory metadata consists of several
predefined components representing specific attributes of a resource, and each item can have one
or more values. These components could be the name of the author, the name of the document,
the publisher's name, the publication date, and the methods to which it belongs.
2. The table of content and the index in a book may be treated metadata for the book.
3. Suppose we say that a data item about a person is 80. This must be defined by noting that it is the
person's weight and the unit is kilograms. Therefore, (weight, kilograms) is the metadata about the
data is 80.
4. Another example of metadata are data about the tables and figures in a report like this book. A
table (which is a record) has a name (e.g., table titles), and there are column names of the tables
that may be treated metadata. The figures also have titles or names.
Technical Metadata − It includes database system names, table and column names and sizes, data
types and allowed values. Technical metadata also includes structural information such as primary
and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
Types of Metadata
o Operational Metadata
o End-User Metadata
Operational Metadata
As we know, data for the data warehouse comes from various operational systems of the
enterprise. These source systems include different data structures. The data elements selected for the data
warehouse have various fields lengths and data types.
In selecting information from the source systems for the data warehouses, we divide records, combine
factor of documents from different source files, and deal with multiple coding schemes and field lengths.
When we deliver information to the end-users, we must be able to tie that back to the source data sets.
Operational metadata contains all of this information about the operational data sources.
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The various roles
of metadata are explained below.
Metadata acts as a directory.
This directory helps the decision support system to locate the contents of the data warehouse.
Metadata helps in decision support system for mapping of data when data is transformed
from operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly summarized data.
Metadata also helps in summarization between lightly detailed data and highly summarized data.
Metadata repository is an integral part of a data warehouse system. It has the following metadata −
Definition of data warehouse − It includes the description of structure of data warehouse. The
description is defined by schema, view, hierarchies, derived data definitions, and data mart
locations and contents.
Business metadata − It contains has the data ownership information, business definition,
and changing policies.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data migrated
and transformation applied on it.
Data for mapping from operational environment to data warehouse − It includes the source
databases and their contents, data extraction, data partition cleaning, transformation rules,
data refresh and purging rules.
A metadata repository is a centralized database or system that is used to store and manage
metadata. Some of the benefits of using a metadata repository include:
1. Improved data quality: A metadata repository can help ensure that metadata is
consistently structured and accurate, which can improve the overall quality of the data.
2. Increased data accessibility: A metadata repository can make it easier for users to access
and understand the data, by providing context and information about the data.
3. Enhanced data integration: A metadata repository can facilitate data integration by providing a
common place to store and manage metadata from multiple sources.
4. Improved data governance: A metadata repository can help enforce metadata standards
and policies, making it easier to ensure that data is being used and managed appropriately.
5. Enhanced data security: A metadata repository can help protect the privacy and security
of metadata, by providing controls to restrict access to sensitive or confidential information.
Metadata repositories can provide many benefits in terms of improving the quality, accessibility,
and management of data.
8. It gives useful data administration tool to manage corporate information assets with the
data dictionary.
There are several challenges that can arise when managing metadata:
2. Data quality: Poorly structured or incorrect metadata can lead to problems with data
quality, making it more difficult to use and understand the data.
3. Data integration: When integrating data from multiple sources, it can be challenging to ensure
that the metadata is consistent and aligned across the different sources.
4. Data governance: Establishing and enforcing metadata standards and policies can be
difficult, especially in large organizations with multiple stakeholders.
5. Data security: Ensuring the security and privacy of metadata can be a challenge, especially when
working with sensitive or confidential information.
A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose
or primary data subject which may be distributed to provide business needs. Data Marts are
analytical record stores designed to focus on particular business functions for a specific
community within an organization. Data marts are derived from subsets of data in a data warehouse,
though in the bottom-up data warehouse design methodology, the data warehouse is created from the
union of organizational data marts.
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to gather, store,
access, and analyze record. It can be used by smaller businesses to utilize the data they have accumulated
since it is less expensive than implementing a data warehouse.
Reasons for creating a data mart
o Ease of creation
o Potential clients are more clearly defined than in a comprehensive data warehouse
There are mainly two approaches to designing data marts. These approaches are
A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to
this technique, the data marts are treated as the subsets of a data warehouse. In this technique, firstly a
data warehouse is created from which further various data marts can be created. These data mart
are dependent on the data warehouse and extract the essential record from it. In this technique, as the data
warehouse creates the data mart; therefore, there is no need for data mart integration. It is also known as a
top-down approach.
Independent Data Marts
The second approach is Independent data marts (IDM) Here, firstly independent data marts are created,
and then a data warehouse is designed using these independent multiple data marts. In this approach, as
all the data marts are designed independently; therefore, the integration of data marts is required. It is also
termed as a bottom-up approach as the data marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
It allows us to combine input from sources other than a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are needed, such as after a new group or product is added
to the organizations.
3. Since, it stores the data related to specific part of an organisation, data retrieval from it is very
quick.
4. Designing and maintenance of data mart is found to be quite cinch as compared to data
warehouse.
7. This Storage unit is used by most of organizations for the smooth running of their departments.
2. Organizations are provided with choices to choose model of data mart depending upon cost and
their business.
1. Since it stores the data related only to specific function, so does not store huge volume of data
related to each and every department of an organization like datawarehouse.
Subset of Data: Data marts are designed to store a subset of data from a larger data warehouse or data
lake. This allows for faster query performance since the data in the data mart is focused on a
specific business unit or department.
Optimized for Query Performance: Data marts are optimized for query performance, which means that
they are designed to support fast queries and analysis of the data stored in the data mart.
Customizable: Data marts are customizable, which means that they can be designed to meet the specific
needs of a business unit or department.
Self-Contained: Data marts are self-contained, which means that they have their own set of
tables, indexes, and data models. This allows for easier management and maintenance of the data mart.
Security: Data marts can be secured, which means that access to the data in the data mart can
be controlled and restricted to specific users or groups.
Scalability: Data marts can be scaled horizontally or vertically to accommodate larger volumes of data or
to support more users.
Integration with Business Intelligence Tools: Data marts can be integrated with business intelligence
tools, such as Tableau, Power BI, or QlikView, which allows users to analyze and visualize the
data stored in the data mart.
ETL Process: Data marts are typically populated using an Extract, Transform, Load (ETL)
process, which means that data is extracted from the larger data warehouse or data lake, transformed to
meet the requirements of the data mart, and loaded into the data
mart.
Note − Do not data mart for any other reason since the operation cost of data marting could be very high.
Before data marting, make sure that data marting strategy is appropriate for your particular solution.
In this step, we determine if the organization has natural functional splits. We look for
departmental splits, and we determine whether the way in which departments use information tend to be
in isolation from the rest of the organization. Let's have an example.
Consider a retail organization, where each merchant is accountable for maximizing the sales of a group of
products. For this, the following are the valuable information −
As the merchant is not interested in the products they are not dealing with, the data marting is a subset of
the data dealing which the product group of interest. The following diagram shows data marting
for different users.
Given below are the issues to be taken into account while determining the functional split −
The merchant could query the sales trend of other products to analyze what is happening to the
sales.
Note − we need to determine the business benefits and technical feasibility of using a data mart.
We need data marts to support user access tools that require internal data structures. The data in
such structures are outside the control of data warehouse but need to be populated and updated on a
regular basis.
There are some tools that populate directly from the source system but some cannot. Therefore additional
requirements outside the scope of the tool are needed to be identified for future.
Note − In order to ensure consistency of data across all access tools, the data should not be
directly
populated from the data warehouse, rather each tool must have its own data mart.
There should to be privacy rules to ensure the data is accessed by authorized users only. For example a
data warehouse for retail banking institution ensures that all the accounts belong to the same legal entity.
Privacy laws can force you to totally prevent access to information that is not owned by the specific bank.
Data marts allow us to build a complete wall by physically separating data segments within the
data warehouse. To avoid possible privacy problems, the detailed data can be removed from the
data warehouse. We can create data mart for each legal entity and load it via data warehouse, with
detailed account data.
3.9 Designing Data Marts
Data marts should be designed as a smaller version of starflake schema within the data warehouse and
should match with the database design of the data warehouse. It helps in maintaining control
over database instances.
The summaries are data marted in the same way as they would have been designed within the
data warehouse. Summary tables help to utilize all dimension data in the starflake schema.
The significant steps in implementing a data mart are to design the schema, construct the
physical storage, populate the data mart with data from source systems, access it to make informed
decisions and manage it over time. So, the steps are:
Designing
The design step is the first in the data mart process. This phase covers all of the functions from initiating
the request for a data mart through gathering data about the requirements and developing the logical and
physical design of the data mart.
Constructing
This step contains creating the physical database and logical structures associated with the data mart to
provide fast and efficient access to the data.
1. Creating the physical database and logical structures such as tablespaces associated with the data
mart.
2. creating the schema objects such as tables and indexes describe in the design step.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it
to the right format and level of detail, and moving it into the data mart.
2. Extracting data
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts
and graphs and publishing them.
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that the end-clients
can interact with the data mart using words which relates to the business functions.
2. Set up and manage database architectures like summarized tables which help queries agree
through the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions
are performed as:
Although data marts are created on the same hardware, they require some additional hardware
and software. To handle user queries, it requires additional processing power and disk storage. If detailed
data and the data mart exist within the data warehouse, then we would face additional cost to
store and manage replicated data.
Note − Data marting is more expensive than aggregations, therefore it should be used as an
additional strategy and not as an alternative strategy.
Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that the LAN
or WAN has the capacity to handle the data volumes being transferred within the data mart
load process.
Network capacity.
Using data partitioning techniques, a huge dataset can be divided into smaller, simpler sections. A few
applications for these techniques include parallel computing, distributed systems, and
database administration. Data partitioning aims to improve data processing performance,
scalability, and efficiency.
To assist backup/recovery,
To enhance performance.
Note − To cut down on the backup size, all partitions other than the current partition can be marked as
read-only. We can then put these partitions into a state where they cannot be modified. Then they can be
backed up. It means only the current partition is to be backed up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance
is enhanced because now the query scans only those partitions that are relevant. It does not have to scan
the whole data.
1. Horizontal Partitioning
2. Vertical Partitioning
3. Key-based Partitioning
4. Range-based Partitioning
5. Hash-based Partitioning
6. Round-robin Partitioning
1. Horizontal Partitioning/Sharding
In this technique, the dataset is divided based on rows or records. Each partition contains a
subset of rows, and the partitions are typically distributed across multiple servers or storage devices.
Horizontal partitioning is often used in distributed databases or systems to improve parallelism
and enable load balancing.
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have
to keep in mind the requirements for manageability of the data warehouse.
This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.
Points to Note
The detailed information remains available online.
The number of physical tables is kept relatively small, which reduces the operating cost.
This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.
This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.
The fact table can also be partitioned on the basis of dimensions other than time such as product group,
region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a state
by state basis. If each region wants to query on information captured within its region, it would prove to
be more effective to partition the fact table into regional partitions. This will cause the queries to speed up
because it does not require to scan information that is not relevant.
Points to Note
The query does not have to scan irrelevant data which speeds up the query process.
This technique is not appropriate where the dimensions are unlikely to change in future. So, it is
worth determining that the dimension does not change in future.
If the dimension changes, then the entire fact table would have to be repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension, unless you
are certain that the suggested dimension grouping will not change within the life of the data warehouse.
When there are no clear basis for partitioning the fact table on any dimension, then we should partition
the fact table on the basis of their size. We can set the predetermined size as a critical point. When the
table exceeds the predetermined size, a new table partition is created.
Points to Note
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here we
have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the response time.
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to
allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data warehouse.
Advantages:
2. Load balancing: By partitioning data, the workload can be distributed equally among
several nodes, avoiding bottlenecks and enhancing system performance.
3. Data separation: Since each partition can be managed independently, data isolation and
fault tolerance are improved. The other partitions can carry on operating even if one fails.
Disadvantages:
1. Join operations: Horizontal partitioning can make join operations across multiple partitions more
complex and potentially slower, as data needs to be fetched from different nodes.
2. Data skew: If the distribution of data is uneven or if some partitions receive more queries
or updates than others, it can result in data skew, impacting performance and load balancing.
2. Vertical Partitioning
Unlike horizontal partitioning, vertical partitioning divides the dataset based on columns or attributes. In
this technique, each partition contains a subset of columns for each row. Vertical partitioning is useful
when different columns have varying access patterns or when some columns are more frequently
accessed than others.
Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is
done.
Vertical partitioning can be performed in the following two ways −
Normalization
Row Splitting
1. Normalization
Normalization is the standard relational method of database organization. In this method, the rows
are collapsed into a single row, hence it reduce space. Take a look at the following tables that
show how normalization is performed.
16 sunny Bangalore W
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Eliminate duplicates for efficiency: Much of the work that normalization accomplishes is related
to eliminating duplicates, which frees space and allows your systems to run more efficiently.
Ensure data is logically organized: Normalization applies a set of rules to associate attributes with
the data, allowing it to be organized by attribute. Because of this, you can organize data
more effectively.
Query data from multiple sources easily: When data is standardized, you can combine data sets
and run queries on the larger set. Many analyses require looking at multiple data sets
from multiple sources.
More easily update data: When data has been normalized, you can more easily update it because
you don’t have to be concerned about updating duplicates.
Facilitate better data governance: Better governance is one of the most valuable benefits of data
normalization. The normalization process is an essential part of creating high-quality data that’s
been vetted and is ready for use. And because normalization also allows you to
effectively organize data, you can track it more easily.
Advantages of Normalization:
Here we can perceive any reason why Normalization is an alluring possibility in RDBMS ideas.
A more modest information base can be kept up as standardization disposes of the copy
information. Generally speaking size of the information base is diminished thus.
Better execution is guaranteed which can be connected to the above point. As information bases
become lesser in size, the goes through the information turns out to be quicker and more limited
in this way improving reaction time and speed.
Narrower tables are conceivable as standardized tables will be tweaked and will have
lesser segments which considers more information records per page.
Fewer files per table guarantees quicker support assignments (file modifies).
Also understands the choice of joining just the tables that are required.
Reduces data redundancy and inconsistency: Normalization eliminates data redundancy and
ensures that each piece of data is stored in only one place, reducing the risk of data inconsistency
and making it easier to maintain data accuracy.
Improves data integrity: By breaking down data into smaller, more specific tables, normalization
helps ensure that each table stores only relevant data, which improves the overall data integrity of
the database.
Facilitates data updates: Normalization simplifies the process of updating data, as it only needs to
be changed in one place rather than in multiple places throughout the database.
Simplifies database design: Normalization provides a systematic approach to database design that
can simplify the process and make it easier to develop and maintain the database over time.
Supports flexible queries: Normalization enables users to query the database using a variety
of different criteria, as the data is organized into smaller, more specific tables that can be
joined together as needed.
Helps ensure database scalability: Normalization helps ensure that the database can scale to meet
future needs by reducing data redundancy and ensuring that the data is organized in a way that
supports future growth and development.
Supports data consistency across applications: Normalization can help ensure that data is
consistent across different applications that use the same database, making it easier to
integrate different applications and ensuring that all users have access to accurate and consistent
data.
Disadvantages of Normalization:
More tables to join as by spreading out information into more tables, the need to join
table’s increments and the undertaking turns out to be more dreary. The information base
gets more enthusiastically to acknowledge too.
Tables will contain codes as opposed to genuine information as the rehashed information will be
put away as lines of codes instead of the genuine information. Thusly, there is consistently a need
to go to the query table.
Data model turns out to be incredibly hard to question against as the information model
is advanced for applications, not for impromptu questioning. (Impromptu question is an inquiry
that can’t be resolved before the issuance of the question. It comprises of a SQL that is
developed progressively and is typically built by work area cordial question devices.).
Subsequently it is difficult to display the information base without understanding what the client
wants.
As the typical structure type advances, the exhibition turns out to be increasingly slow.
Increased complexity: Normalization can increase the complexity of a database design, especially
if the data model is not well understood or if the normalization process is not carried out correctly.
This can lead to difficulty in maintaining and updating the database over time.
Reduced flexibility: Normalization can limit the flexibility of a database, as it requires data to be
organized in a specific way. This can make it difficult to accommodate changes in the data or to
create new reports or applications that require different data structures.
Loss of data context: Normalization can result in the loss of data context, as data may be split
across multiple tables and require additional joins to retrieve. This can make it harder to
understand the relationships between different pieces of data.
Potential for data update anomalies: Normalization can introduce the potential for data
update anomalies, such as insert, update, and delete anomalies, if the database is not properly
designed and maintained.
Need for expert knowledge: Proper implementation of normalization requires expert knowledge
of database design and the normalization process. Without this knowledge, the database may not
be optimized for performance, and data consistency may be compromised.
2. Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed
up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major join
operation between two partitions.
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
region
transaction_date
Suppose the business is organized in 30 geographical regions and each region has different number of
branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because
our requirements capture has shown that a vast majority of queries are restricted to the user's
own business region.
If we partition by transaction_date instead of region, then the latest transaction from every region will be
in one partition. Now the user who wants to look at data within his own region has to query
across multiple partitions.
2. Efficient data retrieval: When a query only requires a subset of columns, vertical
partitioning allows retrieving only the necessary data, saving storage and I/O resources.
3. Simplified schema management: With vertical partitioning, adding or removing columns becomes
easier, as the changes only affect the respective partitions.
Disadvantages:
1. Increased complexity: Vertical partitioning can lead to more complex query execution plans, as
queries may need to access multiple partitions to gather all the required data.
2. Joins across partitions: Joining data from different partitions can be more complex and potentially
slower, as it involves retrieving data from different partitions and combining them.
3. Limited scalability: Vertical partitioning may not be as effective for datasets that
continuously grow in terms of the number of columns, as adding new columns may require
restructuring the partitions.
3. Key-based Partitioning
Using this method, the data is divided based on a particular key or attribute value. The dataset has been
partitioned, with each containing all the data related to a specific key value. Key-based partitioning
is commonly used in distributed databases or systems to distribute the data evenly and allow efficient data
retrieval based on key lookups.
Advantages:
1. Even data distribution: Key-based partitioning ensures that data with the same key value is stored
in the same partition, enabling efficient data retrieval by key lookups.
2. Scalability: Key-based partitioning can distribute data evenly across partitions, allowing for better
parallelism and improved scalability.
3. Load balancing: By distributing data based on key values, the workload is balanced across
multiple partitions, preventing hotspots and optimizing performance.
Disadvantages:
1. Skew and hotspots: If the key distribution is uneven or if certain key values are more frequently
accessed than others, it can lead to data skew or hotspots, impacting performance and
load balancing.
2. Limited query flexibility: Key-based partitioning is most efficient for queries that primarily
involve key lookups. Queries that span multiple keys or require range queries may suffer
from increased complexity and potentially slower performance.
3. Partition management: Managing partitions based on key values requires careful planning
and maintenance, especially when the dataset grows or the key distribution changes.
4. Range Partitioning
Range partitioning divides the dataset according to a predetermined range of values. You can divide data
based on a particular time range, for instance, if your dataset contains timestamps. When you
want to distribute data evenly based on the range of values and have data with natural ordering, range
partitioning can be helpful.
Advantages:
1. Natural ordering: Range partitioning is suitable for datasets with a natural ordering based
on a specific attribute. It allows for efficient data retrieval based on ranges of values.
2. Even data distribution: By dividing the dataset based on ranges, range partitioning can distribute
the data evenly across partitions, ensuring load balancing and optimal performance.
3. Simplified query planning: Range partitioning simplifies query planning when queries primarily
involve range-based conditions, as the system knows which partition(s) to access based on
the range specified.
Disadvantages:
1. Uneven data distribution: If the data distribution is not evenly distributed across ranges, it
can lead to data skew and impact load balancing and query performance.
2. Data growth challenges: As the dataset grows, the ranges may need to be adjusted or new
partitions added, requiring careful management and potentially affecting existing queries and data
distribution.
3. Joins and range queries: Range partitioning can introduce complexity when performing
joins across partitions or when queries involve multiple non-contiguous ranges, potentially
leading to performance challenges.
5. Hash-based Partitioning
Hash partitioning is the process of analyzing the data using a hash function to decide which division it
belongs to. The data is fed into the hash function, which produces a hash value used to categorize the data
into a certain division. By randomly distributing data among partitions, hash-based partitioning can help
with load balancing and quick data retrieval.
Advantages:
3. Simpleness: Hash-based partitioning does not depend on any particular data properties or
ordering, and it is relatively easy to implement.
Disadvantages:
1. Key-based queries: Hash-based partitioning is not suitable for efficient key-based lookups, as the
data is distributed randomly across partitions. Key-based queries may require searching
across multiple partitions.
2. Load balancing challenges: In some cases, the distribution of data may not be perfectly balanced,
resulting in load imbalances and potential performance issues.
6. Round-robin Partitioning
In round-robin partitioning, data is evenly distributed across partitions in a cyclic manner. Each partition
is assigned the next available data item sequentially, regardless of the data’s characteristics. Round-robin
partitioning is straightforward to implement and can provide a basic level of load balancing.
Advantages:
2. Basic load balancing: Round-robin partitioning can provide a basic level of load balancing,
ensuring that data is distributed across partitions evenly.
3. Scalability: It is made possible by round-robin partitioning, which divides the data into
several parts and permits parallel processing.
Disadvantages:
1. Uneven data distribution or a number of partitions that are not a multiple of the total number of
data items may cause round-robin partitioning to produce unequal partition sizes.
2. Inefficient data retrieval: Round-robin partitioning does not consider any data characteristics
or access patterns, which may result in inefficient data retrieval for certain queries.
3. Limited query optimization: Round-robin partitioning does not optimize for specific query
patterns or access patterns, potentially leading to suboptimal query performance.
Partitioning Suitable Query Data
Technique Description Data Performance Distribution Complexity
Distributed
Divides dataset based Large Complex Uneven
Horizontal transaction
on rows/records datasets joins distribution
Partitioning management
Limited query
Divides dataset based Key-value Efficient key Even flexibility
Key-based on specific key datasets lookups distribution
Partitioning by key
Dividing a table into smaller tables Dividing a table into smaller tables based
Definition based on columns. on rows (usually ranges of rows).
Data Columns with related data are Rows with related data (typically based
distribution placed together in the same table. on a range or a condition) are placed
together in the same table.
Maintenance Easier to manage and index specific Each partition can be indexed
and indexing columns based on their independently, making indexing
characteristics and access patterns. more efficient.
May require joins to combine Joins between partitions are typically not
Joins data from multiple partitions needed, as they contain disjoint sets of
when querying. data.
Commonly used for tables with a wide Commonly used for tables with a large
range of columns, where not all number of rows, where data can be
Use cases
columns are frequently grouped based on some criteria (e.g., date
accessed together. ranges).
A Data Warehouse is a vast repository of A data mart is an only subtype of a Data Warehouses. It
information collected from various organizations is architecture to meet the requirement of a specific
or departments within a corporation. user group.
It may hold multiple subject areas. It holds only one subject area. For example, Finance or
Sales.
Works to integrate all data sources It concentrates on integrating data from a given subject
area or set of source systems.
In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake Schema are
used.