0% found this document useful (0 votes)
32 views40 pages

Unit 2 Data Mining & Warehouse

The document discusses data warehouse architecture. It defines key concepts like subject-oriented, integrated, time-variant and non-volatile data. It describes the common components of data warehouse architecture like data sources, ETL process, staging area and data warehouse database. It also discusses multi-dimensional data model and its implementation.

Uploaded by

sneweditz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views40 pages

Unit 2 Data Mining & Warehouse

The document discusses data warehouse architecture. It defines key concepts like subject-oriented, integrated, time-variant and non-volatile data. It describes the common components of data warehouse architecture like data sources, ETL process, staging area and data warehouse database. It also discusses multi-dimensional data model and its implementation.

Uploaded by

sneweditz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Unit 2

Data Warehouse
What is data warehouse architecture?

Data warehouse architecture refers to a subject-oriented, integrated, time-variant, and non-volatile collection of
data in support of management's decision-making process.
Subject-oriented
The data in the warehouse is organized around subjects or topics rather than the applications or source systems that
generate the data. For example, all sales data, regardless of where it comes from, is logically organized and grouped in the
data warehouse. Assembling data around subject areas allows for a unified view of the subject—in our case, sales—rather
than disparate views from each system.
Integrated
The data from each source system (e.g. CRM, ERP, Behavioral Data, or e-commerce platforms) is brought together and
made consistent in the data warehouse. For instance, if one system uses region names like “North Carolina” and another
uses abbreviations such as “NC,” an integrated data warehouse would reconcile the region names to produce a consistent
coding system.
Time-variant
Data in the warehouse is maintained over time, allowing for trend analysis, forecasting, AI/ML, and historical reporting.
It's not just a snapshot of the current moment but a timeline of data that allows businesses to see changes and
developments over time.
Non-volatile
Data written into the warehouse doesn't ever get overwritten or deleted, ensuring the stability and reliability of the data,
which is crucial for trustworthy analysis. Data can be added to the warehouse, but existing data typically remains
unchanged.
Architecture of Data Warehouse:
1. Data Sources:
Data warehouses collect data from multiple sources, including operational databases, CRM systems, ERP systems,
spreadsheets, flat files, and external sources such as social media and web services.

2. ETL (Extract, Transform, Load) Process:


The ETL process involves extracting data from source systems, transforming it into a consistent format suitable for analysis,
and loading it into the data warehouse. This process often includes data cleansing, data validation, data transformation, and
integration of disparate data sources.

3. Staging Area:
The staging area is an intermediate storage area where data extracted from source systems is temporarily stored before it
undergoes transformation and loading into the data warehouse. It helps in data validation, debugging, and maintaining data
integrity during the ETL process.

4. Data Warehouse Database:


The data warehouse database is the central repository where transformed and integrated data from various sources are stored. It
typically uses a relational database management system (RDBMS) optimized for analytical queries and reporting. The database
schema often follows a dimensional modelling approach, such as star schema or snowflake schema, consisting of fact tables and
dimension tables.
5. Data Marts:
Data marts are subsets of the data warehouse that are tailored for specific business units, departments, or user groups within an
organization. They contain pre-aggregated and summarized data relevant to the particular needs of users, enabling faster query
performance and easier access to relevant information.
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data communication processing and
presentation that exist for end-clients computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing, and inventory control are designed for online
transaction processing (OLTP). Such applications gather detailed data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity recently dubbed online
analytical processing (OLAP). These include applications such as forecasting, profiling, summary reporting, and trend
analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a warehouse database
is updated from operational systems periodically, usually during off-hours. As OLTP data accumulates in production
databases, it is regularly extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to users. As
the warehouse is populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies and new
fields and keys added to reflect the needs to the user for sorting, combining, and summarizing data.

Data warehouses and their architectures very depending upon the elements of an organization's situation.
Three common architectures are:
•Data Warehouse Architecture: Basic
•Data Warehouse Architecture: With Staging Area
•Data Warehouse Architecture: With Staging Area and Data Marts
Data
Warehouse
Architectur
e: Basic
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process the day-to-day
transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system must have a different
name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with particular instances of data more
accessible. For example, author, data build, and data changed, and file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data generated by the
warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is updated continuously as new
information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for strategic decision-making. Thes
customers interact with the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
•Reporting and Query Tools
•Application Development Tools
•Executive Information Systems Tools
•Data Warehouse Architecture: With
Staging Area
•We must clean and process your operational
information before put it into the warehouse.
•We can do this programmatically, although
data warehouses uses a staging area (A
place where data is processed before
entering the warehouse).
•A staging area simplifies data cleansing and
consolidation for operational method coming
from multiple source systems, especially for
enterprise data warehouses where all
relevant data of an enterprise is consolidated.
Data Warehouse
Staging Area is a
temporary location
where a record
from source
systems is copied
•Data Warehouse Architecture: With Staging Area
and Data Marts
•We may want to customize our warehouse's
architecture for multiple groups within our
organization.
•We can do this by adding data marts. A data mart
is a segment of a data warehouses that can provided
information for reporting and analysis on a section,
unit, department or operation in the company, e.g.,
sales, payroll, production, etc.
•The figure illustrates an example where purchasing,
sales, and stocks are separated. In this example, a
financial analyst wants to analyze historical data for
purchases and sales or mine historical information to
make predictions about customer behavior.
•What is Multi-Dimensional Data Model?
•A multidimensional model views data in the form of a data-cube. A data
cube enables data to be modeled and viewed in multiple dimensions. It is
defined by dimensions and facts.
•The dimensions are the perspectives or entities concerning which an
organization keeps records. For example, a shop may create a sales data
warehouse to keep records of the store's sales for the dimension time,
item, and location. These dimensions allow the save to keep track of
things, for example, monthly sales of items and the locations at which the
items were sold. Each dimension has a table related to it, called a
dimensional table, which describes the dimension further. For example, a
dimensional table for an item may contain the attributes item_name,
brand, and type.
•A multidimensional data model is organized around a central theme, for
example, sales. This theme is represented by a fact table. Facts are
numerical measures. The fact table contains the names of the facts or
measures of the related dimensional tables.
Implementation of data Model
Working on a Multidimensional Data Model
On the basis of the pre-decided steps, the Multidimensional Data Model works.

The following stages should be followed by every project for building a Multi Dimensional Data Model :

Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model collects correct data from the client.
Mostly, software professionals provide simplicity to the client about the range of data which can be gained with the selected
technology and collect the complete data in detail.

Stage 2 : Grouping different segments of the system : In the second stage, the Multi Dimensional Data Model recognizes and
classifies all the data to the respective section they belong to and also builds it problem-free to apply step by step.

Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the design of the system is based. In this
stage, the main factors are recognized according to the user’s point of view. These factors are also known as “Dimensions”.

Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth stage, the factors which are recognized in
the previous step are used further for identifying the related qualities. These qualities are also known as “attributes” in the
database.

Stage 5 : Finding the actuality of factors which are listed previously and their qualities : In the fifth stage, A Multi Dimensional
Data Model separates and differentiates the actuality from the factors which are collected by it. These actually play a significant
role in the arrangement of a Multi Dimensional Data Model.
•Consider the data of a shop for
items sold per quarter in the city of
Delhi. The data is shown in the table.
In this 2D representation, the sales
for Delhi are shown for the time
dimension (organized in quarters)
and the item dimension (classified
according to the types of an item
sold). The fact or measure displayed
in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data according to time and item, as
well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table.
The 3D data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:
Methods for Multidimensional Data Models
Several methods have been developed to work with multidimensional data models effectively:
OLAP (Online Analytical Processing): OLAP enables users to perform complex analytical queries and computations on
multidimensional data. It allows for interactive exploration of data from different perspectives and supports operations such as slice,
dice, roll-up, and drill-down.

Data Mining: Data mining techniques are used to discover patterns, associations, and insights from multidimensional datasets.
These techniques include clustering, classification, regression, and association rule mining, among others.

Multidimensional Databases: Multidimensional databases are designed specifically to store and retrieve multidimensional data
efficiently. They use specialized storage structures and query optimization techniques to provide fast access to aggregated data.

Dimensional Modeling: Dimensional modeling is a design technique used to structure data for OLAP and data warehouse
applications. It involves organizing data into fact tables (containing measures) and dimension tables (containing descriptive
attributes), facilitating fast query performance and intuitive data analysis.

Visualization Techniques: Visualization plays a crucial role in understanding and interpreting multidimensional data. Techniques
such as scatter plots, heatmaps, treemaps, and parallel coordinates are used to visually explore relationships and patterns in
multidimensional datasets.

Spatial and Temporal Analysis: In addition to traditional dimensions such as product, customer, and time, multidimensional data
models may include spatial and temporal dimensions. Spatial analysis techniques are used to analyze geographic data, while
temporal analysis techniques deal with time-series data.
•Data Cube Technology
•Data cube technology is a critical component in the field of data warehousing and data mining. Let's break down
each concept and how they relate:

•Data Warehousing: Data warehousing involves the process of collecting, storing, and managing large volumes of
data from various sources to support decision-making processes within an organization. Data warehouses are
centralized repositories that store structured, semi-structured, and unstructured data from different operational
sources.

•Data Cube Technology: A data cube is a multidimensional structure used to represent data in multiple
dimensions. It extends the concept of a two-dimensional table to three or more dimensions, enabling analysts to
explore data across various dimensions simultaneously. Each cell in the data cube represents the intersection of
dimensions and contains aggregated data. Data cube technology facilitates efficient querying and analysis of
multidimensional data sets.

•Data Mining: Data mining involves the process of discovering patterns, trends, and insights from large datasets. It
utilizes various techniques, including statistical analysis, machine learning, and pattern recognition, to extract
valuable knowledge from data. Data mining techniques are applied to uncover hidden patterns and relationships
within the data stored in data warehouses.

•The development of data cube technology has significantly contributed to the advancement of data warehousing
and data mining. By organizing data in a multidimensional format, data cubes enable analysts to perform complex
analyses and gain deeper insights into the underlying data. This structured approach to data representation enhances
the efficiency and effectiveness of data mining algorithms, allowing organizations to make informed decisions
based on the extracted knowledge.

•Overall, the integration of data cube technology into data warehousing and data mining processes has facilitated
the exploration, analysis, and interpretation of large and complex datasets, thereby empowering organizations to
derive actionable insights and drive business value from their data assets.
Data cube operations are used to manipulate data to meet the needs of users. These operations help to select particular data for
the analysis purpose. There are mainly 5 operations listed below-

Roll-up: operation and aggregate certain similar data attributes having the same dimension together. For example, if the data
cube displays the daily income of a customer, we can use a roll-up operation to find the monthly income of his salary.

Drill-down: this operation is the reverse of the roll-up operation. It allows us to take particular information and then subdivide it
further for coarser granularity analysis. It zooms into more detail. For example- if India is an attribute of a country column and
we wish to see villages in India, then the drill-down operation splits India into states, districts, towns, cities, villages and then
displays the required information.

Slicing: this operation filters the unnecessary portions. Suppose in a particular dimension, the user doesn’t need everything for
analysis, rather a particular attribute. For example, country=”Jamaica”, this will display only about Jamaica and only display
other countries present on the country list.

Dicing: this operation does a multidimensional cutting, that not only cuts only one dimension but also can go to another
dimension and cut a certain range of it. As a result, it looks more like a subcube out of the whole cube(as depicted in the figure).
For example- the user wants to see the annual salary of Jharkhand state employees.

Pivot: this operation is very important from a viewing point of view. It basically transforms the data cube in terms of view. It
doesn’t change the data present in the data cube. For example, if the user is comparing year versus branch, using the pivot
operation, the user can change the viewpoint and now compare branch versus item type.
•Advantages of data cubes:
•Multi-dimensional analysis: Data cubes enable multi-dimensional analysis of business data, allowing users to view data
from different perspectives and levels of detail.
•Interactivity: Data cubes provide interactive access to large amounts of data, allowing users to easily navigate and
manipulate the data to support their analysis.
•Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling fast and efficient querying and aggregation
of data.
•Data aggregation: Data cubes support complex calculations and data aggregation, enabling users to quickly and easily
summarize large amounts of data.
•Improved decision-making: Data cubes provide a clear and comprehensive view of business data, enabling improved
decision-making and business intelligence.
•Accessibility: Data cubes can be accessed from a variety of devices and platforms, making it easy for users to access and
analyse business data from anywhere.
•Helps in giving a summarised view of data.
•Data cubes store large data in a simple way.
•Data cube operation provides quick and better analysis,
•Improve performance of data.
•Disadvantages of data cube:
• Complexity: OLAP systems can be complex to set up and maintain, requiring specialized technical expertise.
•Data size limitations: OLAP systems can struggle with very large data sets and may require extensive data
aggregation or summarization.
•Performance issues: OLAP systems can be slow when dealing with large amounts of data, especially when
running complex queries or calculations.
•Data integrity: Inconsistent data definitions and data quality issues can affect the accuracy of OLAP analysis.
•Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to the need for specialized
hardware and software.
•Inflexibility: OLAP systems may not easily accommodate changing business needs and may require significant
effort to modify or extend.
Multi-Way Array
Aggregation and bottom-up
computation
algorithm(BUC)
•Multi-way array aggregation and bottom-up computation (BUC) algorithm
are techniques used in data mining and analytics, particularly for processing
multi-dimensional datasets efficiently.

•Multi-way Array Aggregation: This technique involves aggregating data


across multiple dimensions simultaneously. Imagine you have a dataset with
several dimensions (e.g., time, location, product category), and you want to
compute aggregates like sums, averages, or counts across various
combinations of these dimensions. Multi-way array aggregation allows you
to perform these computations efficiently by leveraging the inherent
structure of the data.

•Bottom-Up Computation (BUC) Algorithm: BUC is a method for


computing multi-dimensional aggregates in a bottom-up manner. It's
particularly useful for OLAP (Online Analytical Processing) tasks where
you need to compute aggregations on multi-dimensional data cubes
efficiently. The algorithm works by recursively partitioning the dataset into
smaller sub-cubes and computing aggregates for each sub-cube. These
aggregates are then combined to obtain aggregates for larger cubes,
eventually leading to the computation of aggregates for the entire dataset.
Star Cubing/Schema
Star-Cubing is a data mining technique used for OLAP (Online Analytical Processing) analysis. It's particularly efficient for
generating summaries or aggregations of data in a data warehouse or a multidimensional database.

The term "Star" in Star-Cubing refers to the star schema, a common design for data warehouses where a fact table (containing
the primary data) is surrounded by dimension tables (containing the context or descriptors for the data). Cubing refers to
creating multi-dimensional cubes for analysis.

Here's how Star-Cubing typically works:

Selection of Cube Dimensions: The dimensions along which the data needs to be analyzed are selected. These dimensions can
be time, geography, product category, etc.

Preprocessing: The data is preprocessed to create a cube. This involves grouping data based on combinations of dimensions
and precomputing aggregated values for each group.

Cube Construction: The cube is constructed based on the dimensions selected. It's essentially a multi-dimensional array where
each cell represents the intersection of values from different dimensions, and the value in the cell is the aggregated result (like
sum, count, average, etc.) of the data points falling into that intersection.

Cube Pruning: This step involves reducing the size of the cube by removing redundant or less significant data points. This is
crucial for improving query performance and reducing memory requirements.
Cube Storage and Indexing: The cube is stored in a data structure optimized for fast retrieval, often with indexing
mechanisms to quickly access relevant data points.

Querying: Once the cube is constructed and stored, users can query it to obtain summarized views of the data along different
dimensions. These queries are typically much faster than querying the original raw data, as the cube contains precomputed
aggregations.

Star-Cubing is especially useful for OLAP operations because it allows for quick analysis of large volumes of data across
multiple dimensions. It's commonly used in business intelligence applications for tasks like sales analysis, customer
segmentation, and trend identification
•Advantages of Star Schema :
•Simpler Queries –
Join logic of star schema is quite cinch in comparison to other join logic
which are needed to fetch data from a transactional schema that is highly
normalized.
•Simplified Business Reporting Logic –
In comparison to a transactional schema that is highly normalized, the star
schema makes simpler common business reporting logic, such as of
reporting and period-over-period.
•Feeding Cubes –
Star schema is widely used by all OLAP systems to design OLAP cubes
efficiently. In fact, major OLAP systems deliver a ROLAP mode of
operation which can use a star schema as a source without designing a
cube structure.
•Disadvantages of Star Schema –
•Data integrity is not enforced well since in a highly de-normalized
schema state.
•Not flexible in terms if analytical needs as a normalized data model.
•Star schemas don’t reinforce many-to-many relationships within business
entities – at least not frequently.
•What is OLAP?
•OLAP stands for Online Analytical
Processing, which is a technology that
enables multi-dimensional analysis of
business data. It provides interactive
access to large amounts of data and
supports complex calculations and data
aggregation. OLAP is used to support
business intelligence and decision-
making processes.

•Grouping of data in a multidimensional


matrix is called data cubes. In Dataware
housing, we generally deal with various
multidimensional data models as the
data will be represented by multiple
dimensions and multiple attributes. This
multidimensional data is represented in
the data cube as the cube represents a
high-dimensional space. The Data cube
pictorially shows how different
attributes of data are arranged in the data
model. Below is the diagram of a
general data cube.
Introduction of OLAP and High Dimensional OLAP method

High-Dimensional OLAP (HDOLAP) extends the capabilities of traditional OLAP by addressing the challenges posed by datasets
with a large number of dimensions. In traditional OLAP, querying and analyzing data become increasingly complex and
computationally intensive as the number of dimensions grows. HDOLAP methods are specifically designed to handle datasets
with high dimensionality efficiently.

Here's a brief introduction to both OLAP and HDOLAP:


OLAP:
OLAP systems organize data into multidimensional structures called cubes. These cubes typically consist of dimensions
(attributes or categories along which the data is analyzed) and measures (numeric data that is analyzed).
Users can navigate through different levels of aggregation, drill down into details, pivot dimensions, and slice and dice data to
analyze it from various perspectives.
OLAP operations include roll-up (aggregating data from lower levels to higher levels), drill-down (breaking down aggregated
data into more detailed levels), slice (selecting a subset of data based on one or more dimensions), and dice (selecting a subset of
data based on multiple dimensions).
OLAP systems are characterized by their ability to provide fast query response times, even with large datasets and complex
queries.
.
High-Dimensional OLAP (HDOLAP):
HDOLAP methods address the challenges posed by datasets with a large number of dimensions, which can lead to the "curse
of dimensionality" in traditional OLAP systems.
Traditional OLAP systems may struggle to efficiently handle high-dimensional data due to the exponential increase in the size
of the data cube as the number of dimensions grows.
HDOLAP methods employ techniques such as dimension reduction, sparsity exploitation, and specialized indexing structures
to optimize storage and query processing for high-dimensional datasets.
These methods aim to preserve the analytical capabilities of traditional OLAP while addressing the scalability and
performance issues associated with high-dimensional data.
Further Development of Data Cube and OLAP method

Further development in the field of data cube and OLAP methods involves advancements aimed at enhancing their capabilities,
scalability, and performance to meet the evolving needs of data analysis and decision support. Here are some areas of further
development:

Advanced Query Optimization: Improving query processing and optimization techniques to handle complex analytical queries
more efficiently, particularly in scenarios involving large datasets and high-dimensional data cubes. This may include the
development of innovative algorithms for query rewriting, indexing structures, and caching mechanisms.

Real-Time OLAP: Integrating real-time data processing capabilities into OLAP systems to support analysis of streaming data and
enable timely decision-making. This involves developing techniques for ingesting, processing, and analysing data streams in near
real-time, while still providing interactive and ad-hoc query capabilities.

In-Memory OLAP: Expanding the use of in-memory computing technologies to accelerate data cube processing and analysis. In-
memory OLAP systems store data entirely in RAM, reducing disk I/O latency and enabling faster query response times. Further
development in this area involves optimizing memory utilization, data compression techniques, and memory management
strategies to handle large datasets.

Parallel and Distributed OLAP: Leveraging parallel and distributed computing architectures to scale OLAP systems horizontally
and handle massive volumes of data. This includes developing parallel query processing algorithms, distributed storage
mechanisms, and coordination protocols for distributed OLAP processing across multiple nodes or clusters.
Advanced Analytics Integration: Integrating advanced analytics and machine learning techniques directly into OLAP systems
to enable more sophisticated analysis and predictive modeling. This involves providing built-in support for statistical analysis,
data mining algorithms, predictive modeling, and anomaly detection within OLAP environments.

Semantic OLAP: Enhancing OLAP systems with semantic capabilities to enable more intuitive and context-aware data analysis.
Semantic OLAP involves incorporating semantic metadata, ontologies, and knowledge graphs into OLAP models to enrich data
semantics, support natural language querying, and enable more intelligent query understanding and interpretation.

User Interface and Visualization Enhancements: Improving the user interface and visualization capabilities of OLAP tools to
enhance user experience and facilitate exploratory data analysis. This includes developing interactive and intuitive visualization
techniques, support for dynamic dashboards, and integration with modern BI and data visualization platforms.

Cloud-Based OLAP Solutions: Developing cloud-native OLAP solutions that leverage the scalability, elasticity, and cost-
effectiveness of cloud computing platforms. Cloud-based OLAP offerings enable organizations to deploy and manage OLAP
systems as scalable services, with features such as auto-scaling, pay-as-you-go pricing models, and seamless integration with
other cloud services.
How Data Cube and OLAP are related with each other

A data cube and OLAP (Online Analytical Processing) are closely related concepts in the realm of data analysis and business
intelligence.

Data Cube:

A data cube is a multi-dimensional representation of data used to analyze and visualize complex datasets. It organizes data into
a structure with multiple dimensions, such as time, geography, product, and so on.
Each dimension represents a different attribute or perspective of the data, and the intersections of these dimensions form cells
in the cube, each containing aggregated or summarized data.
Data cubes allow for efficient querying, slicing, dicing, and drilling down into the data to gain insights from various angles.
OLAP (Online Analytical Processing):

OLAP refers to a category of technologies and tools used for analyzing and manipulating multidimensional data from multiple
perspectives.
OLAP enables users to perform complex analyses interactively, including slicing and dicing data, drilling down into details,
rolling up summary data, and pivoting dimensions.
OLAP systems typically operate on data cubes or similar multidimensional data structures, providing a fast and flexible
environment for decision support and business intelligence.
Relationship:
Data cubes are often used as the underlying data structure in OLAP systems. OLAP systems leverage the multidimensional
nature of data cubes to provide users with interactive and intuitive interfaces for analyzing data.
In essence, data cubes provide the structure and organization necessary for OLAP operations. OLAP tools interact with data
cubes to facilitate analytical tasks, such as ad-hoc querying, reporting, and data visualization, making it easier for users to
explore and understand complex datasets.

In summary, data cubes provide the multidimensional data model, while OLAP systems leverage this model to provide
analytical capabilities for interactive data analysis. Together, they enable users to analyze large volumes of data from different
perspectives and gain valuable insights for decision-making.
•Attribute-Oriented Induction

•(AOI) is a machine learning technique used for knowledge discovery and data mining. It's particularly useful for discovering patterns and relationships within
datasets characterized by attributes or features.

•Here's how Attribute-Oriented Induction works:

•Attribute Selection: AOI begins by selecting a target attribute or set of attributes that represent the outcome or class of interest. These attributes are typically
categorical or discrete in nature.

•Attribute Construction: Next, AOI constructs attribute concepts or patterns based on the selected target attributes. These attribute concepts represent
patterns of attributes that are associated with specific classes or outcomes.

•Rule Generation: AOI generates rules or patterns that describe relationships between attribute concepts and the target classes. These rules are often in the
form of "if-then" statements, where the antecedent specifies conditions on attribute values and the consequent specifies the predicted class or outcome.

•Rule Evaluation: The generated rules are evaluated based on various criteria such as accuracy, coverage, and simplicity. Rules that meet predefined criteria
are retained as potential knowledge discovered from the dataset.

•Rule Refinement: AOI may involve iterative refinement of rules to improve their quality and generalization. This may include pruning rules to eliminate
redundancy or overfitting, as well as refining conditions to increase their predictive power.

•Rule Application: Finally, the discovered rules can be applied to new instances or datasets to predict their class labels or outcomes based on their attribute
values.

Attribute-Oriented Induction is commonly used in areas such as data mining, knowledge discovery, and decision support systems. It's particularly effective for discovering
interpretable patterns and rules from datasets with categorical or discrete attributes, making it valuable for tasks such as classification, pattern recognition, and rule-based
reasoning.
Attribute-Oriented Induction
The Attribute-Oriented Induction (AOI) approach to data generalization and summarization – based
characterization was first proposed in 1989 (KDD ‘89 workshop) a few years before the
introduction of the data cube approach.

The data cube approach can be considered as a data warehouse – based, pre computational –
oriented, materialized approach.

It performs off-line aggregation before an OLAP or data mining query is submitted for processing.

On the other hand, the attribute oriented induction approach, at least in its initial proposal, a
relational database query – oriented, generalized – based, on-line data analysis technique.

However, there is no inherent barrier distinguishing the two approaches based on online
aggregation versus offline precomputation.
Some aggregations in the data cube can be computed on-line, while off-line precomputation of
multidimensional space can speed up attribute-oriented induction as well.

It was proposed in 1989 (KDD ‘89 workshop).

It is not confined to categorical data nor particular measures.


How it is done?
Collect the task-relevant data( initial relation) using a relational database query
Perform generalization by attribute removal or attribute generalization.

Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.
Reduces the size of the generalized data set.
Interactive presentation with users.

Basic Principles Of Attribute Oriented Induction


Data focusing:
Analyzing task-relevant data, including dimensions, and the result is the initial relation.

Attribute-removal:
To remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s
higher-level concepts are expressed in terms of other attributes.

Attribute-generalization:
If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator
and generalize A.
Attribute-threshold control:
Typical 2-8, specified/default.

Generalized relation threshold control (10-30):


To control the final relation/rule size.
Algorithm for Attribute Oriented Induction
InitialRel:
It is nothing but query processing of task-relevant data and deriving the initial relation.

PreGen:
It is based on the analysis of the number of distinct values in each attribute and to determine the generalization plan for
each attribute: removal? or how high to generalize?

PrimeGen:
It is based on the PreGen plan and performing the generalization to the right level to derive a “prime generalized relation”
and also accumulating the counts.

Presentation:
User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.

Example
Let's say there is a University database that is to be characterized, for that its corresponding DMQL will be

use University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone_no, GPA
from student
Its corresponding SQL statement can be:

Select name, gender, major, birth_place, birth_date, residence, phone_no, GPA


from student
where status in {“Msc”, “MBA”, “Ph.D.” }

Now for this database let's create a characterized view:

InitialRel:
From this table, we are querying task-relevant data.
From this table, we also removed a few attributes like name and phoneno, because they make no sense in concluding insights.

PreGen
Now, we have generalized these results by removing a few attributes and retaining important attributes.

And also we have generalized a few attributes by naming them "Country" rather than
"Birth_Place", "Age Range" rather than
"Birth_data", "City" rather than
"Residence" and so on as per the table given below.

You might also like