0% found this document useful (0 votes)
5 views

Module-1

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module-1

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Welcome

To

DATA WAREHOUSING DATA


MINING AND - 21CS732

29-11-2024
Department of Information Science and Engg
1
Transform Here
Modules and High Level Topics

Module – 1: Data warehousing and OLAP


Module – 2: Data warehouse implementation & Data Mining
Module – 3: Association Analysis Methods
Module – 4: Classification Methods
Module – 5: Clustering Analysis Methods

29-11-2024
Department of Information Science and Engg
2
Transform Here
Detailed Syllabus – Module Wise
Module-1: Data warehousing and OLAP : Basic Concepts: Data
Warehousing: A multitier Architecture, Data warehouse models: Enterprise
warehouse, Data mart and virtual warehouse, Extraction, Transformation and
loading, Data Cube: A multidimensional data model, Stars, Snowflakes and
Fact constellations: Schemas for multidimensional Data models, Dimensions:
The role of concept Hierarchies, Measures: Their Categorization and
computation, Typical OLAP Operations

Module-2: Data warehouse implementation & Data mining: Efficient Data


Cube computation: An overview, Indexing OLAP Data: Bitmap index and join
index, Efficient processing of OLAP Queries, OLAP server Architecture
ROLAP versus MOLAP Versus HOLAP. : Introduction: What is data mining,
Challenges, Data Mining Tasks, Data: Types of Data, Data Quality, Data
Preprocessing, Measures of Similarity and Dissimilarity

29-11-2024
Department of Information Science and Engg
3
Transform Here
Module-3: Association Analysis: Association Analysis: Problem Definition,
Frequent Item set Generation, Rule generation. Alternative Methods for
Generating Frequent Item sets, FPGrowth Algorithm, Evaluation of
Association Patterns.

Module-4: Classification: Decision Trees Induction, Method for Comparing


Classifiers, Rule Based Classifiers, Nearest Neighbor Classifiers, Bayesian
Classifiers.

Module-5: Clustering Analysis: Overview, K-Means, Agglomerative


Hierarchical Clustering, DBSCAN, Cluster Evaluation, Density-Based
Clustering, Graph-Based Clustering, Scalable Clustering Algorithms.

29-11-2024
Department of Information Science and Engg
4
Transform Here
Course Outcomes

CO1: Apply DWH architecture and multidimensional Modelling for


DWH Solutions
CO2: Design DWH for real world problem statements
CO3: Design association rules and Classification statements for a
given data pattern
CO4: Evaluate the Classification and Clustering techniques for
real world problem statements

29-11-2024
Department of Information Science and Engg
5
Transform Here
Text Books
Text Books:
1. Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining,
Pearson, First impression,2014.
2. Jiawei Han, Micheline Kamber, Jian Pei: Data Mining -Concepts and Techniques, 3rd
Edition, Morgan Kaufmann Publisher, 2012.

Reference Books:

1. Sam Anahory, Dennis Murray: Data Warehousing in the Real World, Pearson, Tenth
Impression,2012.
2. Michael.J.Berry,Gordon.S.Linoff: Mastering Data Mining , Wiley Edition, second
edtion,2012.

29-11-2024
Department of Information Science and Engg
6
Transform Here
We will deep dive into DWH & DM
Module-1: Data Warehousing & Modelling: Basic Concepts: Data Warehousing:
A multitier Architecture.

Data warehouse models: Enterprise warehouse, Data mart and virtual


warehouse.

Extraction, Transformation and loading.

Data Cube: A multidimensional data model, Stars, Snowflakes and Fact


constellations.

Schemas for multidimensional Data models, Dimensions: The role of concept


Hierarchies, Measures.

Their Categorization and computation, Typical OLAP Operations

29-11-2024
Department of Information Science and Engg
7
Transform Here
Basic Definitions
Data: Raw facts that can be recorded/acquired which has an implicit
meaning. Ex- Age, Color, name..etc

Database: A collection of related data, organized in a proper manner


for effective and efficient storage and retrieval purpose.

Database Management System (DBMS): A software


package/ system to facilitate the creation and maintenance of a
computerized database.

Mini-world (DB - Problem Statement): Some part of the real


world about which data is stored in a database. For example, student
grades and transcripts at a university.

29-11-2024
Department of Information Science and Engg
8
Transform Here
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-
making process.”— W. H. Inmon

■ Data warehousing:
■ The process of constructing and using data warehouses
Department of Information Science and Engg
Transform Here 9
Data Warehouse - Subject-Oriented
■ Organized around major subjects, such as customer,
product, sales
■ Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
■ Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process

Department of Information Science and Engg


Transform Here 10
Data Warehouse - Integrated
■ Constructed by integrating multiple, heterogeneous data
sources
■ Relational databases, flat files, on-line transaction
records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.
Department of Information Science and Engg
Transform Here 11
Data Warehouse - Nonvolatile

■ A physically separate store of data transformed from the


operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery, and
concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data

Department of Information Science and Engg


Transform Here 12
Data Warehouse - Time Variant
■ The time horizon for the data warehouse is significantly longer
than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
■ Contains an element of time, explicitly or implicitly
■ But the key of operational data may or may not
contain “time element” (dwh_create_time (dwh_cttm),
dwh_update_time (dwh_up_time)
Department of Information Science and Engg
Transform Here 13
The major distinguishing features of OLTP and OLAP are
summarized as follows:
Users and system orientation:

• An OLTP system is customer-oriented and is used for


transaction and query processing by clerks, clients, and
information technology professionals.

• An OLAP system is market-oriented and is used for data


analysis by knowledge workers, including managers, executives,
and analysts.

OLTP – Online Transaction Processing


OLAP – Online Analytical Processing
Department of Information Science and Engg
Transform Here 14
■ Data contents:

An OLTP system manages current data that, typically, are


too detailed to be easily used for decision making.

An OLAP system manages large amounts of historic data,


provides facilities for summarization and aggregation, and
stores and manages information at different levels of
granularity.

These features make the data easier to use for informed


decision making.

Department of Information Science and Engg


Transform Here 15
Database Design:

An OLTP system usually adopts an entity-relationship (ER) data


model and an application-oriented database design.

An OLAP system typically adopts either a star or a


snowflake model and a subject-oriented database design.

Department of Information Science and Engg


Transform Here 16
■ View: An OLTP system focuses mainly on the current data
within an enterprise or department, without referring to
historic data or data in different organizations.

■ In contrast, an OLAP system often spans multiple versions of


a database schema, due to the evolutionary process of an
organization.

■ OLAP systems also deal with information that originates from


different organizations, integrating information from many
data stores.
■ Because of their huge volume, OLAP data are stored on
multiple storage media.
Department of Information Science and Engg
Transform Here
■ Access patterns: The access patterns of an OLTP system consist
mainly of short, atomic transactions. Such a system requires
concurrency control and recovery mechanisms.

■ However, accesses to OLAP systems are mostly read-only


operations (because most data warehouses store historic rather
than up-to-date information), although many could be complex
queries.

Department of Information Science and Engg


Transform Here
OLTP vs. OLAP
Parameter OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized,
isolated multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Department of Information Science and Engg
Transform Here
How are organizations using the information from
data warehouses?
Many organization use this information to support business decision-making
activities, including
1. Increasing customer focus, which includes the analysis of customer buying
patterns (such as buying preference, buying time, budget cycles, and
appetites for spending);
2. Repositioning products and managing product portfolios by comparing
the performance of sales by quarter, by year, and by geographic regions
in order to fine-tune production strategies;
3. Analyzing operations and looking for sources of profit; and
4. Managing customer relationships, making environmental corrections, and
managing the cost of corporate assets

Department of Information Science and Engg


Transform Here
■ Because operational databases store huge amounts of
data, you may wonder, “Why not perform online
analytical processing directly on such databases instead
of spending additional time and resources to
construct a separate data warehouse?”

Department of Information Science and Engg


Transform Here
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS - tuned for OLTP: access methods, indexing, concurrency control,

recovery
■ Warehouse - tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
■ Different functions and different data:
■ missing data: Decision support (DS) requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP analysis directly
on relational databases
Department of Information Science and Engg
Transform Here
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
■ The bottom tier is a warehouse database server that is
almost always a relational database system. Back-end tools
and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g.,
customer profile information provided by external
consultants)

■ These tools and utilities perform data extraction, cleaning,


and transformation (e.g., to merge similar data from different
sources into a unified format), as well as load and refresh
functions to update the data warehouse

Department of Information Science and Engg


Transform Here
The middle tier is an OLAP server that is typically
implemented using either

a) A relational OLAP(ROLAP)model
(i.e.,an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or

b) A multidimensional OLAP (MOLAP) model


(special-purpose server that directly implements
multidimensional data and operations)

Department of Information Science and Engg


Transform Here
The top tier is a front-end client layer, which contains
query and reporting tools, analysis tools, and/or
data mining tools (e.g., trend analysis, prediction,
and so on).

Department of Information Science and Engg


Transform Here
Three Data Warehouse Models
■ Enterprise warehouse
■ collects all of the information about subjects spanning the

entire organization
■ Data Mart
■ a subset of corporate-wide data that is of value to a specific

groups of users. Its scope is confined to specific, selected


groups, such as marketing data mart
■ Independent vs. dependent (directly from warehouse) data mart
■ Virtual warehouse
■ A set of views over operational databases

■ Only some of the possible summary views may be

materialized
Department of Information Science and Engg
Transform Here
■ A virtual warehouse is easy to build but requires
excess capacity on operational database servers

“What are the pros and cons of the top-down and bottom-up
approaches to data warehouse development?”
■ The top-down development of an enterprise warehouse
serves as a systematic solution and minimizes integration
problems.
■ However, it is expensive, takes a long time to develop,
and lacks flexibility due to the difficulty in achieving
consistency and consensus for a common data model for the
entire organization.

Department of Information Science and Engg


Transform Here
■ The bottom-up approach to the design,
development, and deployment of independent
data marts provides flexibility, low cost, and rapid
return of investment.

■ It, however, can lead to problems when


integrating various disparate data marts into a
consistent enterprise data warehouse.

Department of Information Science and Engg


Transform Here
■ Depending on the source of data, data marts can be
categorized as independent or dependent.
■ Independent data marts are sourced from data
captured from one or more operational systems
or external information providers, or from data
generated locally within a particular department
or geographic area.
■ Dependent data marts are sourced directly from
enterprise data warehouses.

Department of Information Science and Engg


Transform Here
Extraction, Transformation, and Loading (ETL)
■ Data warehouse systems use back-end tools and utilities to populate and
refresh their data These tools and utilities include the following functions:
■ Data extraction
■ get data from multiple, heterogeneous, and external

sources
■ Data cleaning
■ detect errors in the data and rectify them when possible

■ Data transformation
■ convert data from legacy or host format to warehouse

format
■ Load
■ sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the warehouse
Department of Information Science and Engg
Transform Here
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data definitions,
data mart locations and contents.
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error reports,
audit trails)
■ The algorithms used for summarization
■ which include measure and dimension definition algorithms, data on

granularity, partitions, subject areas, aggregation, summarization, and


predefined queries and reports.

Department of Information Science and Engg


Transform Here
■ The mapping from operational environment to the data warehouse
■ which includes source databases and their contents, gateway

descriptions, data partitions, data extraction, cleaning,


transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control).
■ Data related to system performance
■ which include indices and profiles that improve data access and

retrieval performance, in addition to rules for the timing and


scheduling of refresh, update, and replication cycles.
■ Business data
■ which include business terms and definitions, data ownership

information, and charging policies.

Department of Information Science and Engg


Transform Here
Data Warehousing and On-line Analytical Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented Induction

■ Summary

Department of Information Science and Engg


Transform Here
From Tables and Spreadsheets to Data Cubes

■ “What is a data cube?”


“A data cube allows data to be modeled and viewed in
multiple dimensions”.

■ It is defined by dimensions and facts.

Facts are numerical measures.

A dimension is a structure that categorizes data in order to enable


users to answer business questions

Department of Information Science and Engg


Transform Here
Dimensions are the perspectives or entities with respect to which an
organization wants to keep records.
■ Eg: AllElectronics may create a sales data warehouse in order to

keep records of the store’s sales with respect to the dimensions


time, item, branch, and location. These dimensions allow the store
to keep track of things like monthly sales of items and the
branches and locations at which the items were sold.

Each dimension may have a table associated with it, called a


dimension table, which further describes the dimension.
• For example, a dimension table for item may contain the attributes
item name, brand, and type.
• Dimension tables can be specified by users or experts, or
automatically generated and adjusted based on data
distributions.
Department of Information Science and Engg
Transform Here
■ Facts are numeric measures. Think of them as the quantities
by which we want to analyze relationships between
dimensions.

■ Examples of facts for a sales data warehouse include


dollars sold (sales amount in dollars), units sold
(number of units sold), and amount budgeted.

■ The fact table contains the names of the facts, or measures,


as well as keys to each of the related dimension tables.

Department of Information Science and Engg


Transform Here
■ 2-D representation, the sales for Vancouver are
shown with respect to the time dimension
(organized in quarters) and the item
dimension(organized according to the types of
items sold).

■ The factor measure displayed is dollars sold (in


thousands).

Department of Information Science and Engg


Transform Here
Department of Information Science and Engg
Transform Here
■ suppose that we would like to view the sales data
with a third dimension.

■ For instance, suppose we would like to view the


data according to time and item, as well as
location, for the cities Chicago, New York, Toronto,
and Vancouver. These 3-D data are shown in
Table 4.3.

Department of Information Science and Engg


Transform Here
Department of Information Science and Engg
Transform Here
cuboid
■ A 3-D data cube representation of the data inTable4.3, according to
time, item, and location.
■ The measure displayed is dollars sold (in thousands).

Department of Information Science and Engg


Transform Here
Multidimensional Data
Sales volume as a function of product, month, and
region.

Department of Information Science and Engg


Transform Here
■ Suppose that we would now like to view our sales
data with an additional fourth dimension such as
supplier.

Department of Information Science and Engg


Transform Here
A 4-D data cube representation of sales data, according to time, item, location,
and supplier. The measure displayed is dollars sold (in thousands). For improved
readability, only some of the cube values are shown.

Department of Information Science and Engg


Transform Here
Cube: A Lattice of Cuboids

➢ Given a set of dimensions, we can generate a cuboid for each of


the possible subsets of the given dimensions.
➢ The result would form a lattice of cuboids, each showing the data
at a different level of summarization, or group-by.

➢ The lattice of cuboids is then referred to as a data cube.

➢ In previous slide it shows a lattice of cuboids forming a data cube


for the dimensions time, item, location, and supplier.

➢ The lattice of cuboid forms a data cube

Department of Information Science and Engg


Transform Here
■ The cuboid that holds the lowest level of summarization is called the
base cuboid.

■ For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the
given time, item, location, and supplier dimensions.

■ The 0-D cuboid, which holds the highest level of summarization, is


called the apex cuboid.

■ In our example, this is the total sales, or dollars sold, summarized


over all four dimensions. The apex cuboid is typically denoted by all.

Department of Information Science and Engg


Transform Here
Cube: A Lattice of Cuboids

Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
Department of Information Science and Engg
Transform Here
Stars, Snowflakes, and Fact Constellations:
Schemas for Multidimensional Data Models

Department of Information Science and Engg


Transform Here
Conceptual Modeling of Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a set
of dimension tables
■ Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Department of Information Science and Engg
Transform Here
Star Schema:
■ The most common modeling paradigm is the star
schema, in which the data warehouse contains

1) A large central table (fact table) containing the


bulk of the data, with no redundancy, and
2) A set of smaller attendant tables (dimension
tables), one for each dimension.

■ The schema graph resembles a starburst, with the


dimension tables displayed in a radial pattern around
the central fact table.
Department of Information Science and Engg
Transform Here
■ Example 4.1 Star schema. A star schema for
AllElectronics sales is shown in Figure 4.6. Sales
are considered along four dimensions: time, item,
branch, and location. The schema contains a
central fact table for sales that contains keys to
each of the four dimensions, along with two
measures: dollars sold and units sold.
■ To minimize the size of the fact table, dimension
identifiers (e.g., time key and item key) are
system-generated identifiers.

Department of Information Science and Engg


Transform Here
Figure 4.6 Star schema of sales data warehouse.

Department of Information Science and Engg


Transform Here
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key
type
year item_key supplier_type

branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Department of Information Science and Engg
Transform Here
Snowflake Schema:
■ The snowflake schema is a variant of the star
schema model, where some dimension tables are
normalized, thereby further splitting the data into
additional tables.

■ The resulting schema graph forms a shape similar


to a snowflake.

Department of Information Science and Engg


Transform Here
■ The major difference between the snowflake and star schema models is
that the dimension tables of the snowflake model may be kept in
normalized form to reduce redundancies.

■ Such a table is easy to maintain and saves storage space. However,


this space savings is negligible in comparison to the typical magnitude
of the fact table.

■ Furthermore, the snowflake structure can reduce the effectiveness of


browsing, since more joins will be needed to execute a query.
Consequently, the system performance may be adversely impacted.
Hence, although the snowflake schema reduces redundancy, it is not as
popular as the star schema in data warehouse design.

Department of Information Science and Engg


Transform Here
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name
supplier_key
month brand
time_key supplier_type
quarter type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Department of Information Science and Engg
Transform Here
Fact Constellation
Sophisticated applications may require multiple fact tables to share dimension
tables. This kind of schema can be viewed as a collection of stars, and hence is
called a galaxy schema or a fact constellation.

Ex figure: This schema specifies two fact tables, sales and shipping.
The sales table definition is identical to that of the star schema.
The shipping table has five dimensions, or keys: item key, time key, shipper key,
from location, and to location, and two measures: cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact
tables.
For example, the dimensions tables for time, item, and location are shared
between both the sales and shipping fact tables.
Department of Information Science and Engg
Transform Here
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city
units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
Department of Information Science and Engg location_key
Transform Here shipper_type 56
Dimensions: The Role of Concept Hierarchies

■ A concept hierarchy defines a sequence of mappings from


a set of low-level concepts to higher-level, more general
concepts.
■ Consider a concept hierarchy for the dimension location. City
values for location include Vancouver, Toronto, New York,
and Chicago.
■ Many concept hierarchies are implicit within the database
schema. For example, suppose that the dimension location is
described by the attributes number, street, city, province or
state, zip code, and country. These attributes are related by a
total order, forming a concept hierarchy such as “street < city
< province or state < country.”
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
■ Hierarchical and lattice structures of
attributes in warehouse dimensions:
■ (a) a hierarchy for location and
■ (b) a lattice for time.

Lattice :
A regular geometrical arrangement of
points or objects over an area or in
space.

Department of Information Science and Engg


Transform Here
A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Canada ...


Sp Mexi
ain co
city Frankfurt Vancouver Toronto
... ...

office L. Chan ...


M. Wind
Department of Information Science and Engg
Transform Here
View of Warehouses and Hierarchies

Specification of hierarchies
■ Schema hierarchy

day < {month <


quarter; week} < year
■ Set_grouping hierarchy
{1..10} < inexpensive

URL: https://www2.cs.sfu.ca/CourseCentral/459/han/tutorial/tutorial.html

Department of Information Science and Engg


Transform Here
Measures: Their Categorization and Computation

■ A data cube measure is a numeric function that can be


evaluated at each point in the data cube space.
■ A measure value is computed for a given point by
aggregating the data corresponding to the respective
dimension value pairs defining the given point
■ Measures can be organized into three categories
■ Distributive,

■ Algebraic, and

■ Holistic

■ Based on the kind of aggregate functions used.

Department of Information Science and Engg


Transform Here
Data Cube Measures: Three Categories
■ Distributive: if the result derived by applying the function to
n aggregate values, is the same as that derived by applying
the function on all the data without partitioning
■ E.g., count(), sum(), min(), max()
■ Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function
■ E.g., avg(), min_N(), standard_deviation()
■ Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
■ E.g., median(), mode(), rank()

Department of Information Science and Engg


Transform Here
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction

■ Drill down (roll down): reverse of roll-up


■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes

■ Other operations
■ Drill Across: involving (across) more than one fact table

■ Drill Through: through the bottom level of the cube to its back-end
relational tables (using SQL)

Department of Information Science and Engg


Transform Here
Typical OLAP Operations

Department of Information Science and Engg


Transform Here
ADDITIONAL INFORMATION

Department of Information Science and Engg


Transform Here
Design of Data Warehouse: A Business Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the data
warehouse
■ Data source view
■ exposes the information being captured, stored, and managed by
operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view of end-
user
Department of Information Science and Engg
Transform Here
Data Warehouse Design Process
■ Top-down, bottom-up approaches or a combination of both
■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record
Department of Information Science and Engg
Transform Here
Data Warehouse Development: A Recommended Approach

Multi-Tier Data
Warehouse
Distributed
Data Marts

Enterprise
Data Data
Data
Mart Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


Department of Information Science and Engg
Transform Here
Data Warehouse Usage
■ Three kinds of data warehouse applications
■ Information processing
■ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
■ Analytical processing
■ multidimensional analysis of data warehouse data
■ supports basic OLAP operations, slice-dice, drilling, pivoting
■ Data mining
■ knowledge discovery from hidden patterns
■ supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools
Department of Information Science and Engg
Transform Here
From On-Line Analytical Processing (OLAP) to On Line
Analytical Mining (OLAM)
■ Why Online Analytical Mining?
■ High quality of data in data warehouses

■ DW contains integrated, consistent, cleaned data

■ Available information processing structure surrounding data

warehouses
■ ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
■ OLAP-based exploratory data analysis

■ Mining with drilling, dicing, pivoting, etc.

■ On-line selection of data mining functions

■ Integration and swapping of multiple mining

functions, algorithms, and tasks


Department of Information Science and Engg
Transform Here
Reflections about todays Session

Google Form – Quiz


https://docs.google.com/forms/d/e/1FAIpQLSfdiEz7A6Z3iNU26
F6XAxLO0AU6P06AuOl7mGeGKzMOeUhiKw/viewform

29-11-2024
Department of Information Science and Engg
76
Transform Here
Conclusion
We have studied the below concepts in todays class
1. Topics of Module-1
2. Learning Objectives
3. Basic Definitions of database approaches
4. Database system environment
5. Main Characteristics of the Database Approach
6. Advantages of using the DBMS Approach
7. Historical Development of Database Technology
8. Database Languages and Architectures
9. Schemas versus Instances
10.Reflections

29-11-2024
Department of Information Science and Engg
77
Transform Here
Contact Details:

Dr.Manjunath T N
Professor and Dean – ER
Department of Information Science and Engg
BMS Institute of Technology and Management
Mobile: +91-9900130748
E-Mail: [email protected] / [email protected]

29-11-2024
Department of Information Science and Engg
78
Transform Here

You might also like