0% found this document useful (0 votes)
31 views98 pages

Unit-II Data Warehousing

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 98

SCSA3001

Subject Name: Data Mining & Data Warehousing


Faculty Name: K.Babu

UNIT-II
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

1 UNIT-III 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data Warehouse
➢ A data warehouse is a collection of data marts representing
historical data from different operations in the company.
➢ It collect the data from multiple heterogeneous data base files
(flat, text and etc).
➢ It store the5 to 10years of huge amount of data. This data is
stored in a structure optimized for querying and data analysis as a
data warehouse.

2 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Subject Oriented: Data that gives information about a particular


subject instead of about a company’s ongoing operations.
Integrated: Data that is gathered into the data warehouse from a
variety of sources and merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a
particular time period.
Non-volatile: Data is stable in a data warehouse. More data is
added but data is never removed.

3 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Enterprise Data warehouse:


It collects all information about subjects (customers, products,
sales, assets, personnel) that span the entire organization
Decision Support System (DSS): Information technology to help
the knowledge worker (executive, manager, and analyst) makes
faster & better decisions.

4 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Operational and informational Data Operation al Data:


➢ Focusing on transactional function such as bank card
withdrawals and deposits
➢ Detailed
➢ Updateable
➢ Reflects current data

5 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Information al Data:
➢Focusing on providing answers to problems posed by decision
makers
➢Summarized
➢Non updateable
Data Warehouse Characteristics
➢It is a database designed for analytical tasks
➢Its content is periodically updated
➢It contains current and historical data to provide historical
perspective of information.

6 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DATA WAREHOUSECOMPONENTS

➢The data warehouse architecture is based on the data base


management system server.
➢The central information repository is surrounded by number of
key components
➢Data warehouse is an environment, not a product which is based
on relational database management system
➢The data entered into the data warehouse transformed into an
integrated structure and format.

7 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

➢Transformation process involves conversion, summarization,


filtering.
➢The data warehouse must be capable of holding and managing
large volumes of data as well as different structure of data
structures over the time.

8 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Key components
➢Data sourcing, cleanup, transformation, and migration tools
➢Metadata repository
➢Warehouse/database technology
➢Data marts
➢Data query, reporting, analysis, and mining tools
➢Data warehouse administration and management
➢ Information delivery system

9 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

10 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data sourcing, cleanup, transformation, and migration tools


➢They perform conversions, summarization, key changes,
structural changes
➢The data transformation is required to use by decision support
tools.
➢The transformation produces programs, control statements.
➢It moves the data into data warehouse from multiple operational
systems.

11 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

The Functionalities of these tools are listed below:


➢To remove unwanted data from operational db
➢Converting to common data names and attributes
➢Calculating summaries and derived data
➢Establishing defaults for missing data
➢Accommodating source data definition changes

12 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Metadata repository
It is data about data. It is used for maintaining, managing and using
the data warehouse.
Technical Meta data: It contains information about data
warehouse data used by warehouse designer, administrator to
carry out development and management tasks. It includes,
➢Info about data stores.
➢Transformation descriptions.
➢That si mapping methods from operational db to warehouse db.
➢Warehouse Object and data structure definitions for target data
13 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

➢The rules used to perform clean up, and data enhancement


➢Data mapping operations
➢Access authorization, backup history, archive history, info
delivery history, data acquisition history, data access etc.,

14 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Business Meta data: It contains info that gives info stored in data
warehouse to users. It includes,
➢Subject areas, and info object type including queries, reports,
images, video, audio clips etc.
➢ Internet home pages
➢ Info related to info delivery system
➢Data warehouse operational info such as ownerships, audit
trails etc. ,

15 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Meta data helps the users to understand content and find the data.
Meta data are stored in a separate data stores which is known as
informational directory or Meta data repository which helps to
integrate, maintain and view the contents of the data warehouse.

Data ware house database


This is the central part of the data ware housing environment.
This is implemented based on RDBMS technology

16 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data marts It is inexpensive tool and alternative to the data ware


house. it based on the subject area Data mart is used in the
following situation:
➢Extremely urgent user requirement
➢The absence of a budget for a full scale data warehouse strategy
➢The decentralization of business needs

17 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data query, reporting ,analysis, and mining tools Its purpose is to


provide info to business users for decision making. There are five
categories:
➢Data query and reporting tools
➢Application development tools
➢Executive info system tools (EIS)
➢OLAP tools
➢Data mining tools

18 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Query and reporting tools: Used to generate query and report


➢Production reporting tool used to generate regular operational
reports
➢Desktop report writer are inexpensive desktop tools designed for
end users.
Managed Query tools: Used to generate SQL query. It uses Met layer
software in between users and databases which offers appoint-and-
click creation of SQL statement.

19 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Application development tools: This is a graphical data access


environment which integrates OLAP tools with data warehouse and
can be used to access all db systems.
OLAP Tools: Are used to analyze the data in multidimensional and
complex views.
Data mining tools: Are used to discover knowledge from the data
warehouse data.

20 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data ware house administration and management


➢Security and priority management
➢Monitoring updates from multiple sources
➢Data quality checks
➢Managing and updating meta data
➢Auditing and reporting data warehouse usage and status
➢Backup and recovery
➢Data warehouse storage management which includes capacity
planning, hierarchical storage management and purging of aged
data etc.,

21 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Information delivery system


➢It is used to enable the process of subscribing for data warehouse
info.
➢Delivery to one or more destinations according to specified
scheduling algorithm

22 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
BUILDING A DATA WAREHOUSE

Business factors:
➢Business users want to make decision quickly and correctly
using all available data.
Technological factors:
➢To address the incompatibility of operational data stores
➢IT infrastructure is changing rapidly. Its capacity is increasing
and cost is decreasing so that building a data warehouse is easy

23 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Business factors
Top – Down Approach It collected enterprise wide business
requirements and decided to build an enterprise data warehouse
with subset data marts.
Bottom Up Approach The data marts are integrated or combined
together to form a data warehouse.
Developing and integrating data marts as and when the
requirements are clear.
The advantage of using the Bottom Up approach is that they do not
require high initial costs and have a faster implementation time;

24 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Design considerations:
➢In general a data warehouse data from multiple heterogeneous
sources into a query database this is also one of the reasons why a
data warehouse is difficult to built Data content
➢The content and structure of the data warehouse are reflected in
its data model.
➢The data model is the template that describes how information
will be organized within the integrated warehouse framework.
➢The data warehouse data must be a detailed data. It must be
formatted, cleaned up and Transformed to fit the warehouse data
model.
25 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Meta data
➢It defines the location and contents of data in the warehouse.
➢Meta data is searchable by users to find definitions or subject
areas.
Data distribution
➢Data volumes continue to grow in nature. Therefore, it becomes
necessary to know how the data should be divided across multiple
servers.
➢The data can be distributed based on the subject area, location
(geographical region), or time (current, month, year)

26 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Technological factors
Technical considerations:
Hardware platforms
➢An important consideration when choosing a data warehouse
server capacity for handling the high volumes of data.
➢It has large data and through put.
➢The modern server can also support large volumes and large
number of flexible GUI

27 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Data warehouse and DBMS specialization
➢Very large size of databases and need to process complex adhoc
queries in a short time
➢The most important requirements for the data warehouse DBMS
are performance, throughput and scalability.
Communication infrastructure
➢The data warehouse user requires a relatively large band width
to interact with the data warehouse and retrieve a significant
amount of data for analysis.
➢This may mean that communication networks have to be
expanded and new hardware and software may have purchased.
28 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Access tools:
Data warehouse implementation relies on selecting suitable data
access tools. The following lists the various type of data that can be
accessed:
1. Simple tabular form data
2. Ranking data
3. Multivariable data
4. Time series data
5. Graphing, charting and pivoting data

29 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data extraction, clean up, transformation and migration:


➢Timeliness of data delivery to the warehouse
➢The tool must have the ability to identify the particular data and
that can be read by conversion tool.
➢The tool must support flat files, indexed files since corporate
data is still in this type
➢The tool must have the capability to merge data from multiple
data stores
➢The tool should have specification interface to indicate the data
to be extracted

30 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

➢The tool should have the ability to read data from data
dictionary The code generated by the tool should be completely
maintainable
➢The data warehouse database system must be able to perform
loading data directly from these tools.

31 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data replication
➢Data replication or data moves to place the data to a particular
workgroup in a localized database.
➢Most companies use data replication servers to copy their most
needed data to a separate database.

Metadata
➢It is a road map to the information stores in the warehouse is
metadata it defines all elements and their attributes.

32 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data placement strategies


➢As a data warehouse grows, there are at least two options for
data placement. One is to put some of the data in the data
warehouse into another storage media.
➢The second option is to distribute the data in the data warehouse
across multiple servers.

33 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

User levels
Casual users: are most comfortable in retrieving info from
warehouse in predefined formats and running pre-existing
queries and reports.
Power Users: can use pre defined as well as user defined queries
to create simple and ad hoc reports. These users can engage in
drill down operations. These users may have the experience of
using reporting and query tools.
Expert users: These users tend to create their own complex
queries and perform standard analysis on the info they retrieve.

34 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Benefits of data warehousing


Tangible benefits
➢Improvement in product inventory
➢Decrement in production cost
➢Improvement in selection of target markets
Intangible (not easy to quantified):
➢Improvement in productivity by keeping all data in single
location and eliminating rekeying of data.
➢Reduced redundant processing Enhanced customer relation.

35 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Multi Dimensional Data Model

The multi-Dimensional Data Model is a method which is used for


ordering data in the database along with good arrangement and
assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to
interrogate analytical questions associated with market or
business trends, unlike relational databases which allow
customers to access data in the form of queries.

36 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

It represents data in the form of data cubes. Data cubes allow to


model and view the data from many dimensions and perspectives.
It is defined by dimensions and facts and is represented by a fact
table. Facts are numerical measures and fact tables contain
measures of the related dimensional tables or names of the facts.

37 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Working on a Multidimensional Data Model

Stage 1 : Assembling data from the client : In first stage, a Multi


Dimensional Data Model collects correct data from the client.
Mostly, software professionals provide simplicity to the client
about the range of data which can be gained with the selected
technology and collect the complete data in detail.
Stage 2 : Grouping different segments of the system : In the
second stage, the Multi Dimensional Data Model recognizes and
classifies all the data to the respective section they belong to and
also builds it problem-free to apply step by step.
38 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Stage 3 : Noticing the different proportions : In the third stage,


it is the basis on which the design of the system is based. In this
stage, the main factors are recognized according to the user’s
point of view. These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their
respective qualities : In the fourth stage, the factors which are
recognized in the previous step are used further for identifying
the related qualities. These qualities are also known
as “attributes” in the database.

39 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Stage 5 : Finding the actuality of factors which are listed


previously and their qualities : A Multi Dimensional Data Model
separates and differentiates the actuality from the factors which are
collected by it. These actually play a significant role in the
arrangement of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to
the information collected from the steps above : In the sixth
stage, on the basis of the data which was collected previously, a
Schema is built.

40 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

For Example :
1. Let us take the example of a firm. The revenue cost of a firm can
be recognized on the basis of different factors such as geographical
location of firm’s workplace, products of the firm, advertisements
done, time utilized to flourish a product, etc.

41 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

42 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Let us take the example of the data of a factory which sells


products per quarter in Bangalore. The data is represented in the
table given below :

43 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

In the above given presentation, the factory’s sales for Bangalore


are, for the time dimension, which is organized into quarters and
the dimension of items, which is sorted according to the kind of
item which is sold. The facts here are represented in rupees (in
thousands).
Now, if we desire to view the data of the sales in a three-
dimensional table, then it is represented in the diagram given
below. Here the data of the sales is represented as a
two dimensional table. Let us consider the data according to item,
time and location (like Kolkata, Delhi, Mumbai). Here is the table :

44 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

45 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

This data can be represented in the form of three dimensions


conceptually, which is shown in the image below :

46 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Advantages of Multi Dimensional Data Model


✓ A multi-dimensional data model is easy to handle.
✓ It is easy to maintain.
✓ Its performance is better than that of normal databases
✓ The representation of data is better than traditional databases.
That is because the multi-dimensional databases are multi-viewed
and carry different types of factors.

47 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Disadvantages of Multi Dimensional Data Model


➢ The multi-dimensional Data Model is slightly complicated in
nature and it requires professionals to recognize and examine the
data in the database.
➢ During the work of a Multi-Dimensional Data Model, when the
system caches, there is a great effect on the working of the
system.
➢ It is complicated in nature due to which the databases are
generally dynamic in design.

48 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

OLAP OPERATIONS IN THE MULTIDIMENSIONAL DATA MODEL

OLAP stands for Online Analytical Processing Server. It is a


software technology that allows users to analyze information
from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on
multi-dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP
databases are divided into one or more cubes and these cubes
are known as Hyper-cubes.

49 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

50 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

OLAP operations:
Drill down: In drill-down operation, the less detailed data is
converted into highly detailed data. It can be done by:
✓ Moving down in the concept hierarchy
✓ Adding a new dimension
In the cube given in overview section, the drill down operation is
performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).

51 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

52 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Roll up: It is just opposite of the drill-down operation. It performs


aggregation on the OLAP cube. It can be done by:
✓Climbing up in the concept hierarchy
✓Reducing the dimensions
In the cube given in the overview section, the roll-up operation is
performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

53 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

54 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Dice:
It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube
is selected by selecting following dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”

55 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

56 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Slice: It selects a single dimension from the OLAP cube which


results in a new sub-cube creation. In the cube given in the
overview section, Slice is performed on the dimension Time =
“Q1”.

57 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Pivot: It is also known as rotation operation as it rotates the


current view to get a new view of the representation. In the sub-
cube obtained after the slice operation, performing pivot
operation gives a new view of it.

58 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Three-Tier Data Warehouse Architecture

Data Warehouses usually have a three-level (tier) architecture that


includes:
✓Bottom Tier (Data Warehouse Server)
✓Middle Tier (OLAP Server)
✓Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which
is almost always an RDBMS. It may include several specialized data
marts and a metadata repository.

59 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data from operational databases and external sources (such as


user profile data provided by external consultants) are extracted
using application program interfaces called a gateway. A gateway
is provided by the underlying DBMS and allows customer
programs to generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database
Connection) and OLE-DB (Open-Linking and Embedding for
Databases), by Microsoft, and JDBC (Java Database Connection).

60 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
A middle-tier which consists of an OLAP server for fast querying
of the data warehouse.
(1) A Relational OLAP model, i.e., an extended relational DBMS
that maps functions on multidimensional data to standard
relational operations.
(2) A Multidimensional OLAP model, i.e., a particular purpose
server that directly implements multidimensional information and
operations.
A top-tier that contains front-end tools for displaying results
provided by OLAP, as well as additional tools for data mining of
the OLAP-generated data.
61 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

62 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

The metadata repository stores information that defines DW


objects. It includes the following parameters and information for
the middle and the top-tier applications:
1. A description of the DW structure, including the warehouse
schema, dimension, hierarchies, data mart locations, and
contents, etc.
2. Operational metadata, which usually describes the currency
level of the stored data, i.e., active, archived or purged, and
warehouse monitoring information, i.e., usage statistics, error
reports, audit, etc.
63 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

1. System performance data, which includes indices, used to


improve data access and retrieval performance.
2. Information about the mapping from operational databases,
which provides source RDBMSs and their contents, cleaning
and transformation rules, etc.
3. Summarization algorithms, predefined queries, and reports
business data, which include business terms and definitions,
ownership information, etc.

64 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Load Performance
Data warehouses require increase loading of new data periodically
basis within narrow time windows; performance on the load
process should be measured in hundreds of millions of rows and
gigabytes per hour and must not artificially constrain the volume of
data business.
Load Processing
Many phases must be taken to load new or update data into the
data warehouse, including data conversion, filtering, reformatting,
indexing, and metadata update.

65 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data Quality Management


Fact-based management demands the highest data quality. The
warehouse ensures local consistency, global consistency, and
referential integrity despite "dirty" sources and massive database
size.
Query Performance
Fact-based management must not be slowed by the performance of the
data warehouse RDBMS; large, complex queries must be complete in
seconds, not days.

66 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCHEMAS FOR MULTI-DIMENSIONAL DATA MODEL

Schema is a logical description of the entire database. It includes


the name and description of records of all record types including
all associated data-items and aggregates. Much like a database, a
data warehouse also requires to maintain a schema.
A database uses relational model, while a data warehouse uses
Star, Snowflake, and Fact Constellation schema. In this chapter,
we will discuss the schemas used in a data warehouse.

67 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Star Schema
➢ Each dimension in a star schema is represented with only one-
dimension table.
➢ This dimension table contains the set of attributes.
➢ The following diagram shows the sales data of a company with
respect to the four dimensions, namely time, item, branch, and
location.

68 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

69 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

➢There is a fact table at the center. It contains the keys to each of


four dimensions.
➢The fact table also contains the attributes, namely dollars sold
and units sold.
Note − For example, the location dimension table contains the
attribute set {location key, street, city, province_or_state,country}.
This constraint may cause data redundancy.

70 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Snowflake Schema
➢Some dimension tables in the Snowflake schema are
normalized.
➢The normalization splits up the data into additional tables.
➢Unlike Star schema, the dimensions table in a snowflake schema
are normalized. For example, the item dimension table in star
schema is normalized and split into two dimension tables, namely
item and supplier table.

71 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

72 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Now the item dimension table contains the attributes item_key,


item_name, type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The
supplier dimension table contains the attributes supplier_key and
supplier_type.
Note − Due to normalization in the Snowflake schema, the
redundancy is reduced and therefore, it becomes easy to maintain
and the save storage space.

73 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Fact Constellation Schema
➢ A fact constellation has multiple fact tables. It is also known as
galaxy schema.
➢ The following diagram shows two fact tables, namely sales and
shipping.
➢ The sales fact table is same as that in the star schema.
➢ The shipping fact table has the five dimensions, namely
item_key, time_key, shipper_key, from_location, to_location.
➢ The shipping fact table also contains two measures, namely
dollars sold and units sold.

74 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

It is also possible to share dimension tables between fact tables.


For example, time, item, and location dimension tables are shared
between the sales and shipping fact table.

75 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
OLAP
Online Analytical Processing (OLAP) is a category of software
that allows users to analyze information from multiple database
systems at the same time. It is a technology that enables analysts to
extract and view business data from different points of view.
How does it work?
A Data warehouse would extract information from multiple data
sources and formats like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into
an OLAP server where information is pre-calculated in advance for
further analysis.
76 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Basic analytical operations of OLAP

1) Roll-up:
Roll-up is also known as “consolidation” or “aggregation.” The Roll-
up operation can be performed in 2 ways
Reducing dimensions
Climbing up concept hierarchy. Concept hierarchy is a system of
grouping things based on their order or level.

77 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

78 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

➢In this example, cities New jersey and Lost Angles and rolled up
into country USA
➢The sales figure of New Jersey and Los Angeles are 440 and
1560 respectively. They become 2000 after roll-up
➢In this aggregation process, data is location hierarchy moves up
from city to the country.
➢In the roll-up process at least one or more dimensions need to
be removed. In this example, Cities dimension is removed.

79 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

2.Drill-down
In drill-down data is fragmented into smaller parts. It is the
opposite of the rollup process. It can be done via
➢Moving down the concept hierarchy
➢Increasing a dimension
Consider the diagram below
Quater Q1 is drilled down to months January, February, and March.
Corresponding sales are also registers.
In this example, dimension months are added.

80 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

81 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

3) Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:
➢ Dimension Time is Sliced with Q1 as the filter.
➢ A new cube is created altogether.

82 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

83 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Dice:
This operation is similar to a slice. The difference in dice is you
select 2 or more dimensions that result in the creation of a sub-
cube.

84 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

85 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

4) Pivot
In Pivot, you rotate the data axes to provide a substitute
presentation of data.
In the following example, the pivot is based on item types.

86 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

87 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Types of OLAP

88 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Type of OLAP Explanation


ROLAP is an extended RDBMS along with
Relational
multidimensional data mapping to perform
OLAP(ROLAP)
the standard relational operation.
Multidimensional MOLAP Implements operation in
OLAP (MOLAP) multidimensional data.
In HOLAP approach the aggregated totals are
Hybrid Online stored in a multidimensional database while
Analytical the detailed data is stored in the relational
Processing database. This offers both data efficiency of
(HOLAP) the ROLAP model and the performance of the
MOLAP model.

89 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

In Desktop OLAP, a user downloads a part of the


Desktop OLAP
data from the database locally, or on their
(DOLAP)
desktop and analyze it.
Web OLAP which is OLAP system accessible via
Web OLAP the web browser. WOLAP is a three-tiered
(WOLAP) architecture. It consists of three components:
client, middleware, and a database server.
Mobile OLAP helps users to access and analyze
Mobile OLAP:
OLAP data using their mobile devices
SOLAP is created to facilitate management of both
Spatial OLAP : spatial and non-spatial data in a Geographic
Information system (GIS)

90 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Advantages of OLAP
➢ OLAP is a platform for all type of business includes planning,
budgeting, reporting, and analysis.
➢ Information and calculations are consistent in an OLAP cube.
This is a crucial benefit.
➢ Quickly create and analyze “What if” scenarios
➢ Easily search OLAP database for broad or specific terms.
➢ OLAP provides the building blocks for business modeling tools,
Data mining tools, performance reporting tools.

91 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Disadvantages of OLAP
➢OLAP requires organizing data into a star or snowflake schema.
These schemas are complicated to implement and administer
➢You cannot have large number of dimensions in a single OLAP
cube
➢Transactional data cannot be accessed with OLAP system.
➢Any modification in an OLAP cube needs a full update of the
cube. This is a time-consuming process

92 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Comparisons of OLAP vs OLTP

OLAP (Online analytical OLTP (Online transaction


processing) processing)

Consists of historical data from Consists only operational


various Databases. current data.

It is subject oriented. Used for


It is application oriented. Used
Data Mining, Analytics,
for business tasks.
Decision making,etc.

The data is used in planning,


The data is used to perform day
problem solving and decision
to day fundamental operations.
making.

93 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

The size of the data is relatively


Large amount of data is stored
small as the historical data is
typically in TB, PB
archived. For ex MB, GB

Relatively slow as the amount of


Very Fast as the queries operate
data involved is large. Queries
on 5% of the data.
may take hours.

It only need backup from time Backup and recovery process is


to time as compared to OLTP. maintained religiously

94 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

This data is generally managed This data is managed by clerks,


by CEO, MD, GM. managers.

Only read and rarely write


Both read and write operations.
operation.

95 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Architecture for On-Line Analytical Mining: An OLAM server


performs analytical mining in data cubes in a similar manner as
an OLAP server performs on-line analytical processing.
Where the OLAM and OLAP servers both accept user on-line
queries (or commands) via a graphical user interface API and
work with the data cube in the data analysis via a cube API.

96 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

97 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

A metadata directory is used to guide the access of the data cube.


The data cube can be constructed by accessing and/or integrating
multiple databases via an MDDB API and/or by filtering a data
warehouse via a database API that may support OLE DB or ODBC
connections.
Since an OLAM server may perform multiple data mining tasks,
such as concept description, association, classification, prediction,
clustering, time-series analysis, and so on, it usually consists of
multiple integrated data mining modules and is more
sophisticated than an OLAP server.
98 UNIT-II 7/8/2024

You might also like