0% found this document useful (0 votes)
19 views57 pages

Chapter 2 Data Warehousing

Uploaded by

Medhansh Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views57 pages

Chapter 2 Data Warehousing

Uploaded by

Medhansh Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Data

Warehousing

Prof Sadanand S Borse


Data Warehouse
 A data warehouse can be defined as a collection of organizational data and
information extracted from operational sources and external data sources.

 The data is periodically pulled from various internal applications like sales,
marketing, and finance; customer-interface applications; as well as external
partner systems.

 This data is then made available for decision-makers to access and analyze.

 Data warehouses are exclusively intended to perform queries and analysis and
often contain large amounts of historical data. The data within a data
warehouse is usually derived from a wide range of sources such as application
log files and transaction applications.
Key Characteristics of Data
Warehouse
 Subject-Oriented
 Integrated
 Non-Volatile
 Time-Variant
A typical data warehouse often includes
the following elements:
• A relational database to store and manage data
• An extraction, loading, and transformation (ELT)
solution for preparing the data for analysis
• Statistical analysis, reporting, and data mining
capabilities
• Client analysis tools for visualizing and presenting data
to business users
• Other, more sophisticated analytical applications that
generate actionable information by applying
data science and artificial intelligence (AI) algorithms,
or graph and spatial features that enable more kinds of
What is ELT?
 It's the process of collecting data from multiple sources and transforming it into a
usable format for analysis.
 Extract, transform, and load (ETL) is a data pipeline used to collect data from
various sources.
 ETL stands for extract, transform and load, which is a data integration process
that combines data from multiple data sources into a single, consistent data store
that is loaded into a data warehouse or other target system.
• Extract data from legacy systems
• Cleanse the data to improve data quality and establish consistency
• Load data into a target database
Database vs Data Warehouse
 A data warehouse and a traditional database share some
similarities, But they need not be the same idea.
 The main difference is that in a database, data is collected for
multiple transactional purposes.
 In a data warehouse, data is collected on an extensive scale to
perform analytics.
 Databases provide real-time data, while warehouses store data to
be accessed for big analytical queries.
 Data warehouse is an example of an OLAP system or an online
database query answering system. OLTP is an online database
modifying system, for example, ATM.
Data Warehouse Architecture

 Simple.
 Simple with a staging area.
 Hub and spoke.
 Sandboxes
Data Warehouse Architecture

 Thedata warehouse architecture comprises a three-tier


structure.
 Bottom Tier
 Middle Tier
 Top Tier
How Data Warehouse Works

 Data Warehousing integrates data and information collected from


various sources into one comprehensive database.
 Data mining is one of the features of a data warehouse that involves
looking for meaningful data patterns in vast volumes of data and
devising innovative strategies for increased sales and profits.
 Data Warehouse works as a central repository where information arrives
from one or more data sources. Data flows into a data warehouse from
the transactional system and other relational databases.
 Data may be:
1. Structured
2. Semi-structured
3. Unstructured data
 The data is processed, transformed, and ingested so that users can access the processed data in the Data
Warehouse through Business Intelligence tools, SQL clients, and spreadsheets.
 A data warehouse merges information coming from different sources into one comprehensive database.
 By merging all this information in one place, an organization can analyze its customers more holistically.
This helps to ensure that it has considered all the information available.
 Data warehousing makes data mining possible. Data mining is looking for patterns in the data that may lead
to higher sales and profits.
Types of Data Warehouse

 Offline Operational Database:


 Offline Data Warehouse:
 Real time Data Warehouse:
 Integrated Data Warehouse:
Four components of Data
Warehouses are:
 Load manager
 Warehouse Manager:
 Query Manager:
 End-user access tools:
Data lakes
 A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale.
 It is a place to store every type of data in its native format with no fixed
limits on account size or file.
 It offers high data quantity to increase analytic performance and native
integration.
 Data Lake is like a large container that is very similar to a real lake and
river.
 Just like in a lake you have multiple tributaries coming in, a data lake
has structured data, unstructured data, machine to machine, and logs
flowing through in real-time.
 A data lake can include structured data from relational
databases (rows and columns), semi-structured data
(CSV, logs, XML, JSON), unstructured data (emails,
documents, PDFs) and binary data (images, audio,
video.
 A data lake can be recognized as "on premises" (within
an organization's data centers) or "in the cloud" (using
cloud services from vendors such as Amazon, Microsoft,
or Google).
 The Data Lake democratizes data and is a cost-effective
way to store all data of an organization for later
processing. Research Analysts can focus on finding
meaningful patterns in data and not data itself.
 Unlike a hierarchal Data Warehouse where data is
stored in Files and Folder, Data lake has a flat
architecture. Every data element in a Data Lake is given
a unique identifier and tagged with a set of metadata
information.
 The main objective of building a data lake is to offer an
unrefined view of data-to-data scientists.
Reasons for using Data Lake
 There is no need to model data into an enterprise-wide
schema with a Data Lake.
 With the increase in data volume, data quality, and
metadata, the quality of analyses also increases.
 Data Lake offers business Agility
 Machine Learning and Artificial Intelligence can be used to
make profitable predictions.
 It offers a competitive advantage to the implementing
organization.
 There is no data silo structure. Data Lake gives 360 degrees
view of customers and makes analysis more robust
Characteristics Data Warehouse Data Lake

Non-relational and relational from


Relational from transactional systems, operational IoT devices, web sites, mobile apps,
Data
databases, and line of business applications social media, and corporate
applications

Designed prior to the DW implementation (schema-on- Written at the time of analysis


Schema
write) (schema-on-read)

Price/ Query results getting faster using


Fastest query results using higher cost storage
Performance low-cost storage

Any data that may or may not be


Data Quality Highly curated data that serves as the central version of
curated (ie. raw data)
the truth

Data scientists, Data developers,


Users Business analysts and Business analysts (using
curated data)

Machine Learning, Predictive


Analytics Batch reporting, BI and visualizations analytics, data discovery and
profiling
1. into the data lake in batches or in real-time
2. Insights Tier: The tiers on the right represent the research side
where insights from the system are used. SQL, NoSQL queries, or
even excel could be used for data analysis.
3. HDFS is a cost-effective solution for both structured and
unstructured data. It is a landing zone for all data that is at rest in
the system.
4. Distillation tier takes data from the storage tire and converts it to
structured data for easier analysis.
5. Processing tier run analytical algorithms and users queries with
varying real-time, interactive, batch to generate structured data for
easier analysis.
6. Unified operations tier governs system management and
monitoring. It includes auditing and proficiency management, data
management, workflow management.
7. Ingestion Tier: The tiers on the left side depict the data sources. The
Key Concepts of Data Lake
 Data mining is the process of discovering actionable
information from large sets of data. Data mining uses
mathematical analysis to derive patterns and trends
that exist in data.
 Typically, these patterns cannot be discovered by
traditional data exploration because the relationships
are too complex or because there is too much data.
 Data mining is the process of finding anomalies,
patterns, and correlations within large data sets to
predict outcomes.
Data mining as a step in the process of
knowledge discovery
The architecture of a typical data
mining system has the following
major components
 Database, data warehouse, World Wide Web, or another
information repository
 Database or data warehouse server
 Knowledge base
 Datamining engine:
 Pattern evaluation module:
 User interface:
 Data mining process is the discovery through large data sets of
patterns, relationships and insights that guide enterprises
measuring and managing where they are and predicting where
they will be in the future.
Large amount of data and databases can come from various data
sources and may be stored in different data warehouses. And data
mining techniques such as machine learning, artificial intelligence
(AI) and predictive modeling can be involved.
The data mining process requires commitment and business
intelligence tools. But experts agree, across all industries, the data
mining process is the same. And should follow a prescribed path.
Six steps of Data Mining
1. Business understanding

• it is required to understand business objectives clearly and


find out what are the business’s needs.
• Next, assess the current situation by finding the resources,
assumptions, constraints, and other important factors which
should be considered.
• Then, from the business objectives and current situations,
create data mining goals to achieve the business objectives
within the current situation.
• Finally, a good data mining plan has to be established to
achieve both business and data mining goals. The plan should
be as detailed as possible.
2. Data understanding
• The data understanding phase starts with initial data collection, which
is collected from available data sources, to help get familiar with the
data. Some important activities must be performed including data load
and data integration in order to make the data collection successful.
• Next, the “gross” or “surface” properties of acquired data need to be
examined carefully and reported.
• Then, the data needs to be explored by tackling the data mining
questions, which can be addressed using querying, reporting, and
visualization.
• Finally, the data quality must be examined by answering some
important questions such as “Is the acquired data complete?”, “Is there
any missing values in the acquired data?”
3. Data preparation

 The data preparation typically consumes about 90% of


the time of the project.
 The outcome of the data preparation phase is the final
data set. Once available data sources are identified,
they need to be selected, cleaned, constructed, and
formatted into the desired form.
 The data exploration task at a greater depth may be
carried out during this phase to notice the patterns
based on business understanding.
4. Modeling

• First, modeling techniques have to be selected to be


used for the prepared data set.
• Next, the test scenario must be generated to validate the
quality and validity of the model.
• Then, one or more models are created on the prepared
data set.
• Finally, models need to be assessed carefully involving
stakeholders to make sure that created models are met
business initiatives.
5. Evaluation

 In the evaluation phase, the model results must be


evaluated in the context of business objectives in the
first phase.
 In this phase, new business requirements may be
raised due to the new patterns that have been
discovered in the model results or from other factors.
 Gaining business understanding is an iterative process
in data mining. The go or no-go decision must be made
in this step to move to the deployment phase.
6. Deployment
 The knowledge or information, which is gained through the data
mining process, needs to be presented in such a way that
stakeholders can use it when they want it.
 Based on the business requirements, the deployment phase could be
as simple as creating a report or as complex as a repeatable data
mining process across the organization.
 In the deployment phase, the plans for deployment, maintenance,
and monitoring have to be created for implementation and also future
support.
 From the project point of view, the final report of the project needs to
summarize the project experiences and review the project to see what
needs to be improved created learned lessons.
OLAP

 OLAP (Online Analytical Processing) is a category of database


processing that facilitates business intelligence.
 OLAP (Online Analytical Processing) is the technology behind
many Business Intelligence (BI) applications.
 OLAP is a powerful technology for data discovery, including
capabilities for report viewing, complex analytical calculations,
and predictive “what if” scenario (budget, forecast) planning.
 In a data warehouse, data sets are stored in tables, each of which
can organize data into just two of these dimensions at a time.
 OLAP extracts data from multiple relational data sets and
reorganizes it into a multidimensional format that enables very
fast processing and very insightful analysis.
 OLAP tools do not store individual transaction records in two-
dimensional, row-by-column format, like a worksheet, but
instead, use multidimensional database structures—known
as Cubes in OLAP terminology—to store arrays of consolidated
information.
 The data and formulas are stored in an optimized
multidimensional database, while views of the data are created
on-demand.
What is an OLAP cube?
 OLAP databases are divided into one or more
cubes. The cubes are designed in such a way
that creating and viewing reports become
easy.
 The OLAP cube is an array-based
multidimensional database that makes it
possible to process and analyze multiple data
dimensions much more quickly and efficiently
than a traditional relational database.
 In theory, a cube can contain
an infinite number of layers.
 Smaller cubes can exist
within layers—for example,
each store layer could contain
cubes arranging sales by
salesperson and product.
 In practice, data analysts will
create OLAP cubes containing
just the layers they need, for
optimal analysis and
performance.
Examples of OLAP Tools

 Dundas BI
 Sisense
 IBM Cognos Analytics
 InetSoft
 SAP Business Intelligence
 Halo
OLAP for Multidimensional Analysis

 To analyze and report on the health of a


business and plan future activity, many
variable groups or parameters must be tracked
on a continuous basis—which is beyond the
scope of any number of linked spreadsheets.
 These variable groups or parameters are called
Dimensions in the On-Line Analytical
Processing (OLAP) environment.
 Analysts can take any view, or Slice, of a Cube to produce a
worksheet-like view of points of interest.
 Instead of working on two dimensions (standard spreadsheet) or
three dimensions (for example, a workbook with tabs of the same
report, by one variable), companies have many dimensions to
track.
 For example, a business that distributes goods from more than a
single facility will have at least the following Dimensions to
consider: Accounts, Locations, Periods, Salespeople, and Products.
 These Dimensions comprise a base for the company’s planning,
analysis, and reporting activities.
 Together they represent the “whole” business picture, providing the
foundation for all business planning, analysis and reporting
activities.
Advantages of OLAP

 OLAP is a platform for all type of business includes


planning, budgeting, reporting, and analysis.
 Information
and calculations are consistent in an
OLAP cube. This is a crucial benefit.
 Quickly create and analyze “What if” scenarios
 Easily search OLAP database for broad or specific
terms.
 OLAP provides the building blocks for business
modeling tools, Data mining tools, performance
reporting tools.
Advantages of OLAP

 Allows users to do slice and dice cube data all by


various dimensions, measures, and filters.
 It is good for analyzing time series.
 Finding some clusters and outliers is easy with OLAP.
 It is a powerful visualization online analytical process
system which provides faster response times
Drill Down
• Drill down: In drill-down
operation, the less detailed data
is converted into highly detailed
data. It can be done by:
• Moving down in the concept
hierarchy
• Adding a new dimension
 In the cube given in the overview
section, the drill-down operation
is performed by moving down in
the concept hierarchy of the
Time dimension (Quarter ->
Month).
Roll UP

 Roll-up is also known as


“consolidation” or “aggregation.”
 It can be done by:
• Climbing up in the concept
hierarchy
• Reducing the dimensions
 In the cube given in the overview
section, the roll-up operation is
performed by climbing up in the
concept hierarchy of Location
dimension (City -> Country)
Dice
 Dice: It selects a sub-
cube from the OLAP cube
by selecting two or more
dimensions.
 The difference in dice is
you select 2 or more
dimensions that result in
the creation of a sub-
cube.
Slice:

 Slice: It selects a single


dimension from the OLAP
cube which results in a new
sub-cube creation. In the cube
given in the overview section,
Slice is performed on the
dimension Time = “Q1”.
Pivot

 Pivot: It is also known as


rotation operation as it rotates
the current view to get a new
view of the representation.
 In the sub-cube obtained after
the slice operation, performing
pivot operation gives a new
view of it.
Types of OLAP Systems in DBMS

 Multidimensional OLAP (MOLAP) – Cube-based – MOLAP is an


abbreviation for Multi-dimensional Online Analytical Processing. In
this type of analytical processing, multidimensional databases
(MDDBs) are used to store data. This data is later used for
analysis. MOLAP consists of data that is pre-computed and
fabricated. The data cubes from MDDBs carry data that has
already been calculated. This increases the speed of querying data.
 Relational OLAP (ROLAP) – Star Schema based –
 Hybrid OLAP (HOLAP) - HOLAP is a combination of ROLAP and
MOLAP.
 MOLAP is an abbreviation for
Multi-dimensional Online
Analytical Processing.
 In this type of analytical
processing, multidimensional
databases (MDDBs) are used
to store data.
 This data is later used for
analysis. MOLAP consists of
data that is pre-computed
and fabricated.
 The data cubes from MDDBs
carry data that has already
been calculated. This
increases the speed of
querying data.
MOLAP
 Advantages
• It performs well with operations such as slice and dice.
• Users can use it to perform complex calculations.
• It consists of pre-computed data that can be indexed
fast.
 Disadvantages
• It can only store a limited volume of data.
• The data used for analysis depends on certain
requirements that were set (previously). This limits
data analysis and navigation.
ROLAP

 ROLAP is an abbreviation for Relational Online


Analytical Processing.
 In this type of analytical processing, data
storage is done in a relational database.
 In this database, the arrangement of data is
made in rows and columns. Data is presented
to end-users in a multi-dimensional form.
 There are three main components in a
ROLAP model:
 Database server: This exists in the data
layer. This consists of data that is loaded
into the ROLAP server.
 ROLAP server: This consists of the ROLAP
engine that exists in the application layer.
 Front-end tool: This is the client desktop
that exists in the presentation layer.
 Let’s briefly look at how ROLAP works. When
a user makes a query (complex), the ROLAP
server will fetch data from the RDBMS
server. The ROLAP engine will then create
data cubes dynamically. The user will view
data from a multi-dimensional point.
ROLAP
 Advantages
 It can handle huge volumes of data.
 A ROLAP model can store data efficiently.
 ROLAP utilizes a relational database. This enables the model to
integrate the ROLAP server with an RDBMS (relational database
management system).
 Disadvantages
 There is slow performance, especially when the volume of data is
huge.
 ROLAP has certain limitations relating to SQL. For example, the SQL
feature has difficulties in handling complex calculations.
HOLAP
 This is an abbreviation for Hybrid Online Analytical
Processing. This type of analytical processing solves the
limitations of MOLAP and ROLAP and combines their
attributes.
 Data in the database is divided into two parts:
specialized storage and relational storage.
 Integrating these two aspects addresses issues relating
to performance and scalability.
 HOLAP stores huge volumes of data in a relational
database and keeps aggregations in a MOLAP server.
• The HOLAP model consists of
a server that can support
ROLAP and MOLAP.
• It consists of a complex
architecture that requires
frequent maintenance.
• Queries made in the HOLAP
model involve the multi-
dimensional database and
the relational database.
• The front-user tool presents
data from the database
management system (directly)
or through the intermediate
MOLAP.
HOLAP
 Advantages
 It improves performance and scalability because it combines multi-
dimensional and relational attributes of online analytical processing.
 It is a resourceful analytical processing tool if we expect the size of data to
increase.
 Its processing ability is higher than the other two analytical processing
tools.
 Disadvantages
 The model uses a huge storage space because it consists of data from two
databases.
 The model requires frequent updates because of its complex nature.
Basis of
Comparison MOLAP ROLAP HOLAP
Meaning Multi-Dimensional Online Relational Online Hybrid Online
Analytical Processing Analytical Processing Analytical Processing

Data Storage It stores data in a multi- It stores data in a It stores data in a


dimensional database. relational database. relational database
Technique It utilizes the Sparse Matrix It employs Structured It uses a combination
technique. Query Language (SQL). of SQL and Sparse
Matrix technique.

Volume of data It can process a limited It processes enormous It can process huge
volume of data. data. volumes of data.

Designed view The multi-dimensional view The multi-dimensional The multi-dimensional


is static. view is dynamic. view is dynamic.

Data It arranges data in data It arranges data in There is a multi-


arrangement cubes. rows and columns dimensional
(tables). arrangement of data

You might also like