0% found this document useful (0 votes)

32 views40 pages

Unit 2 Data Mining & Warehouse

The document discusses data warehouse architecture. It defines key concepts like subject-oriented, integrated, time-variant and non-volatile data. It describes the common components of data warehouse architecture like data sources, ETL process, staging area and data warehouse database. It also discusses multi-dimensional data model and its implementation.

Uploaded by

sneweditz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views40 pages

Unit 2 Data Mining & Warehouse

Uploaded by

sneweditz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Unit 2

Data Warehouse
What is data warehouse architecture?

Data warehouse architecture refers to a subject-oriented, integrated, time-variant, and non-volatile collection of
data in support of management's decision-making process.
Subject-oriented
The data in the warehouse is organized around subjects or topics rather than the applications or source systems that
generate the data. For example, all sales data, regardless of where it comes from, is logically organized and grouped in the
data warehouse. Assembling data around subject areas allows for a unified view of the subject—in our case, sales—rather
than disparate views from each system.
Integrated
The data from each source system (e.g. CRM, ERP, Behavioral Data, or e-commerce platforms) is brought together and
made consistent in the data warehouse. For instance, if one system uses region names like “North Carolina” and another
uses abbreviations such as “NC,” an integrated data warehouse would reconcile the region names to produce a consistent
coding system.
Time-variant
Data in the warehouse is maintained over time, allowing for trend analysis, forecasting, AI/ML, and historical reporting.
It's not just a snapshot of the current moment but a timeline of data that allows businesses to see changes and
developments over time.
Non-volatile
Data written into the warehouse doesn't ever get overwritten or deleted, ensuring the stability and reliability of the data,
which is crucial for trustworthy analysis. Data can be added to the warehouse, but existing data typically remains
unchanged.
Architecture of Data Warehouse:
1. Data Sources:
Data warehouses collect data from multiple sources, including operational databases, CRM systems, ERP systems,
spreadsheets, flat files, and external sources such as social media and web services.

2. ETL (Extract, Transform, Load) Process:

The ETL process involves extracting data from source systems, transforming it into a consistent format suitable for analysis,
and loading it into the data warehouse. This process often includes data cleansing, data validation, data transformation, and
integration of disparate data sources.

3. Staging Area:
The staging area is an intermediate storage area where data extracted from source systems is temporarily stored before it
undergoes transformation and loading into the data warehouse. It helps in data validation, debugging, and maintaining data
integrity during the ETL process.

4. Data Warehouse Database:

The data warehouse database is the central repository where transformed and integrated data from various sources are stored. It
typically uses a relational database management system (RDBMS) optimized for analytical queries and reporting. The database
schema often follows a dimensional modelling approach, such as star schema or snowflake schema, consisting of fact tables and
dimension tables.
5. Data Marts:
Data marts are subsets of the data warehouse that are tailored for specific business units, departments, or user groups within an
organization. They contain pre-aggregated and summarized data relevant to the particular needs of users, enabling faster query
performance and easier access to relevant information.
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data communication processing and
presentation that exist for end-clients computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing, and inventory control are designed for online
transaction processing (OLTP). Such applications gather detailed data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity recently dubbed online
analytical processing (OLAP). These include applications such as forecasting, profiling, summary reporting, and trend
analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a warehouse database
is updated from operational systems periodically, usually during off-hours. As OLTP data accumulates in production
databases, it is regularly extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to users. As
the warehouse is populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies and new
fields and keys added to reflect the needs to the user for sorting, combining, and summarizing data.

Data warehouses and their architectures very depending upon the elements of an organization's situation.
Three common architectures are:
•Data Warehouse Architecture: Basic
•Data Warehouse Architecture: With Staging Area
•Data Warehouse Architecture: With Staging Area and Data Marts
Data
Warehouse
Architectur
e: Basic
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process the day-to-day
transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system must have a different
name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with particular instances of data more
accessible. For example, author, data build, and data changed, and file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data generated by the
warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is updated continuously as new
information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for strategic decision-making. Thes
customers interact with the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
•Reporting and Query Tools
•Application Development Tools
•Executive Information Systems Tools
•Data Warehouse Architecture: With
Staging Area
•We must clean and process your operational
information before put it into the warehouse.
•We can do this programmatically, although
data warehouses uses a staging area (A
place where data is processed before
entering the warehouse).
•A staging area simplifies data cleansing and
consolidation for operational method coming
from multiple source systems, especially for
enterprise data warehouses where all
relevant data of an enterprise is consolidated.
Data Warehouse
Staging Area is a
temporary location
where a record
from source
systems is copied
•Data Warehouse Architecture: With Staging Area
and Data Marts
•We may want to customize our warehouse's
architecture for multiple groups within our
organization.
•We can do this by adding data marts. A data mart
is a segment of a data warehouses that can provided
information for reporting and analysis on a section,
unit, department or operation in the company, e.g.,
sales, payroll, production, etc.
•The figure illustrates an example where purchasing,
sales, and stocks are separated. In this example, a
financial analyst wants to analyze historical data for
purchases and sales or mine historical information to
make predictions about customer behavior.
•What is Multi-Dimensional Data Model?
•A multidimensional model views data in the form of a data-cube. A data
cube enables data to be modeled and viewed in multiple dimensions. It is
defined by dimensions and facts.
•The dimensions are the perspectives or entities concerning which an
organization keeps records. For example, a shop may create a sales data
warehouse to keep records of the store's sales for the dimension time,
item, and location. These dimensions allow the save to keep track of
things, for example, monthly sales of items and the locations at which the
items were sold. Each dimension has a table related to it, called a
dimensional table, which describes the dimension further. For example, a
dimensional table for an item may contain the attributes item_name,
brand, and type.
•A multidimensional data model is organized around a central theme, for
example, sales. This theme is represented by a fact table. Facts are
numerical measures. The fact table contains the names of the facts or
measures of the related dimensional tables.
Implementation of data Model
Working on a Multidimensional Data Model
On the basis of the pre-decided steps, the Multidimensional Data Model works.

The following stages should be followed by every project for building a Multi Dimensional Data Model :

Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model collects correct data from the client.
Mostly, software professionals provide simplicity to the client about the range of data which can be gained with the selected
technology and collect the complete data in detail.

Stage 2 : Grouping different segments of the system : In the second stage, the Multi Dimensional Data Model recognizes and
classifies all the data to the respective section they belong to and also builds it problem-free to apply step by step.

Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the design of the system is based. In this
stage, the main factors are recognized according to the user’s point of view. These factors are also known as “Dimensions”.

Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth stage, the factors which are recognized in
the previous step are used further for identifying the related qualities. These qualities are also known as “attributes” in the
database.

Stage 5 : Finding the actuality of factors which are listed previously and their qualities : In the fifth stage, A Multi Dimensional
Data Model separates and differentiates the actuality from the factors which are collected by it. These actually play a significant
role in the arrangement of a Multi Dimensional Data Model.
•Consider the data of a shop for
items sold per quarter in the city of
Delhi. The data is shown in the table.
In this 2D representation, the sales
for Delhi are shown for the time
dimension (organized in quarters)
and the item dimension (classified
according to the types of an item
sold). The fact or measure displayed
in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data according to time and item, as
well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table.
The 3D data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:
Methods for Multidimensional Data Models
Several methods have been developed to work with multidimensional data models effectively:
OLAP (Online Analytical Processing): OLAP enables users to perform complex analytical queries and computations on
multidimensional data. It allows for interactive exploration of data from different perspectives and supports operations such as slice,
dice, roll-up, and drill-down.

Data Mining: Data mining techniques are used to discover patterns, associations, and insights from multidimensional datasets.
These techniques include clustering, classification, regression, and association rule mining, among others.

Multidimensional Databases: Multidimensional databases are designed specifically to store and retrieve multidimensional data
efficiently. They use specialized storage structures and query optimization techniques to provide fast access to aggregated data.

Dimensional Modeling: Dimensional modeling is a design technique used to structure data for OLAP and data warehouse
applications. It involves organizing data into fact tables (containing measures) and dimension tables (containing descriptive
attributes), facilitating fast query performance and intuitive data analysis.

Visualization Techniques: Visualization plays a crucial role in understanding and interpreting multidimensional data. Techniques
such as scatter plots, heatmaps, treemaps, and parallel coordinates are used to visually explore relationships and patterns in
multidimensional datasets.

Spatial and Temporal Analysis: In addition to traditional dimensions such as product, customer, and time, multidimensional data
models may include spatial and temporal dimensions. Spatial analysis techniques are used to analyze geographic data, while
temporal analysis techniques deal with time-series data.
•Data Cube Technology
•Data cube technology is a critical component in the field of data warehousing and data mining. Let's break down
each concept and how they relate:

•Data Warehousing: Data warehousing involves the process of collecting, storing, and managing large volumes of
data from various sources to support decision-making processes within an organization. Data warehouses are
centralized repositories that store structured, semi-structured, and unstructured data from different operational
sources.

•Data Cube Technology: A data cube is a multidimensional structure used to represent data in multiple
dimensions. It extends the concept of a two-dimensional table to three or more dimensions, enabling analysts to
explore data across various dimensions simultaneously. Each cell in the data cube represents the intersection of
dimensions and contains aggregated data. Data cube technology facilitates efficient querying and analysis of
multidimensional data sets.

•Data Mining: Data mining involves the process of discovering patterns, trends, and insights from large datasets. It
utilizes various techniques, including statistical analysis, machine learning, and pattern recognition, to extract
valuable knowledge from data. Data mining techniques are applied to uncover hidden patterns and relationships
within the data stored in data warehouses.

•The development of data cube technology has significantly contributed to the advancement of data warehousing
and data mining. By organizing data in a multidimensional format, data cubes enable analysts to perform complex
analyses and gain deeper insights into the underlying data. This structured approach to data representation enhances
the efficiency and effectiveness of data mining algorithms, allowing organizations to make informed decisions
based on the extracted knowledge.

•Overall, the integration of data cube technology into data warehousing and data mining processes has facilitated
the exploration, analysis, and interpretation of large and complex datasets, thereby empowering organizations to
derive actionable insights and drive business value from their data assets.
Data cube operations are used to manipulate data to meet the needs of users. These operations help to select particular data for
the analysis purpose. There are mainly 5 operations listed below-

Roll-up: operation and aggregate certain similar data attributes having the same dimension together. For example, if the data
cube displays the daily income of a customer, we can use a roll-up operation to find the monthly income of his salary.

Drill-down: this operation is the reverse of the roll-up operation. It allows us to take particular information and then subdivide it
further for coarser granularity analysis. It zooms into more detail. For example- if India is an attribute of a country column and
we wish to see villages in India, then the drill-down operation splits India into states, districts, towns, cities, villages and then
displays the required information.

Slicing: this operation filters the unnecessary portions. Suppose in a particular dimension, the user doesn’t need everything for
analysis, rather a particular attribute. For example, country=”Jamaica”, this will display only about Jamaica and only display
other countries present on the country list.

Dicing: this operation does a multidimensional cutting, that not only cuts only one dimension but also can go to another
dimension and cut a certain range of it. As a result, it looks more like a subcube out of the whole cube(as depicted in the figure).
For example- the user wants to see the annual salary of Jharkhand state employees.

Pivot: this operation is very important from a viewing point of view. It basically transforms the data cube in terms of view. It
doesn’t change the data present in the data cube. For example, if the user is comparing year versus branch, using the pivot
operation, the user can change the viewpoint and now compare branch versus item type.
•Advantages of data cubes:
•Multi-dimensional analysis: Data cubes enable multi-dimensional analysis of business data, allowing users to view data
from different perspectives and levels of detail.
•Interactivity: Data cubes provide interactive access to large amounts of data, allowing users to easily navigate and
manipulate the data to support their analysis.
•Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling fast and efficient querying and aggregation
of data.
•Data aggregation: Data cubes support complex calculations and data aggregation, enabling users to quickly and easily
summarize large amounts of data.
•Improved decision-making: Data cubes provide a clear and comprehensive view of business data, enabling improved
decision-making and business intelligence.
•Accessibility: Data cubes can be accessed from a variety of devices and platforms, making it easy for users to access and
analyse business data from anywhere.
•Helps in giving a summarised view of data.
•Data cubes store large data in a simple way.
•Data cube operation provides quick and better analysis,
•Improve performance of data.
•Disadvantages of data cube:
• Complexity: OLAP systems can be complex to set up and maintain, requiring specialized technical expertise.
•Data size limitations: OLAP systems can struggle with very large data sets and may require extensive data
aggregation or summarization.
•Performance issues: OLAP systems can be slow when dealing with large amounts of data, especially when
running complex queries or calculations.
•Data integrity: Inconsistent data definitions and data quality issues can affect the accuracy of OLAP analysis.
•Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to the need for specialized
hardware and software.
•Inflexibility: OLAP systems may not easily accommodate changing business needs and may require significant
effort to modify or extend.
Multi-Way Array
Aggregation and bottom-up
computation
algorithm(BUC)
•Multi-way array aggregation and bottom-up computation (BUC) algorithm
are techniques used in data mining and analytics, particularly for processing
multi-dimensional datasets efficiently.

•Multi-way Array Aggregation: This technique involves aggregating data

across multiple dimensions simultaneously. Imagine you have a dataset with
several dimensions (e.g., time, location, product category), and you want to
compute aggregates like sums, averages, or counts across various
combinations of these dimensions. Multi-way array aggregation allows you
to perform these computations efficiently by leveraging the inherent
structure of the data.

•Bottom-Up Computation (BUC) Algorithm: BUC is a method for

computing multi-dimensional aggregates in a bottom-up manner. It's
particularly useful for OLAP (Online Analytical Processing) tasks where
you need to compute aggregations on multi-dimensional data cubes
efficiently. The algorithm works by recursively partitioning the dataset into
smaller sub-cubes and computing aggregates for each sub-cube. These
aggregates are then combined to obtain aggregates for larger cubes,
eventually leading to the computation of aggregates for the entire dataset.
Star Cubing/Schema
Star-Cubing is a data mining technique used for OLAP (Online Analytical Processing) analysis. It's particularly efficient for
generating summaries or aggregations of data in a data warehouse or a multidimensional database.

The term "Star" in Star-Cubing refers to the star schema, a common design for data warehouses where a fact table (containing
the primary data) is surrounded by dimension tables (containing the context or descriptors for the data). Cubing refers to
creating multi-dimensional cubes for analysis.

Here's how Star-Cubing typically works:

Selection of Cube Dimensions: The dimensions along which the data needs to be analyzed are selected. These dimensions can
be time, geography, product category, etc.

Preprocessing: The data is preprocessed to create a cube. This involves grouping data based on combinations of dimensions
and precomputing aggregated values for each group.

Cube Construction: The cube is constructed based on the dimensions selected. It's essentially a multi-dimensional array where
each cell represents the intersection of values from different dimensions, and the value in the cell is the aggregated result (like
sum, count, average, etc.) of the data points falling into that intersection.

Cube Pruning: This step involves reducing the size of the cube by removing redundant or less significant data points. This is
crucial for improving query performance and reducing memory requirements.
Cube Storage and Indexing: The cube is stored in a data structure optimized for fast retrieval, often with indexing
mechanisms to quickly access relevant data points.

Querying: Once the cube is constructed and stored, users can query it to obtain summarized views of the data along different
dimensions. These queries are typically much faster than querying the original raw data, as the cube contains precomputed
aggregations.

Star-Cubing is especially useful for OLAP operations because it allows for quick analysis of large volumes of data across
multiple dimensions. It's commonly used in business intelligence applications for tasks like sales analysis, customer
segmentation, and trend identification
•Advantages of Star Schema :
•Simpler Queries –
Join logic of star schema is quite cinch in comparison to other join logic
which are needed to fetch data from a transactional schema that is highly
normalized.
•Simplified Business Reporting Logic –
In comparison to a transactional schema that is highly normalized, the star
schema makes simpler common business reporting logic, such as of
reporting and period-over-period.
•Feeding Cubes –
Star schema is widely used by all OLAP systems to design OLAP cubes
efficiently. In fact, major OLAP systems deliver a ROLAP mode of
operation which can use a star schema as a source without designing a
cube structure.
•Disadvantages of Star Schema –
•Data integrity is not enforced well since in a highly de-normalized
schema state.
•Not flexible in terms if analytical needs as a normalized data model.
•Star schemas don’t reinforce many-to-many relationships within business
entities – at least not frequently.
•What is OLAP?
•OLAP stands for Online Analytical
Processing, which is a technology that
enables multi-dimensional analysis of
business data. It provides interactive
access to large amounts of data and
supports complex calculations and data
aggregation. OLAP is used to support
business intelligence and decision-
making processes.

•Grouping of data in a multidimensional

matrix is called data cubes. In Dataware
housing, we generally deal with various
multidimensional data models as the
data will be represented by multiple
dimensions and multiple attributes. This
multidimensional data is represented in
the data cube as the cube represents a
high-dimensional space. The Data cube
pictorially shows how different
attributes of data are arranged in the data
model. Below is the diagram of a
general data cube.
Introduction of OLAP and High Dimensional OLAP method

High-Dimensional OLAP (HDOLAP) extends the capabilities of traditional OLAP by addressing the challenges posed by datasets
with a large number of dimensions. In traditional OLAP, querying and analyzing data become increasingly complex and
computationally intensive as the number of dimensions grows. HDOLAP methods are specifically designed to handle datasets
with high dimensionality efficiently.

Here's a brief introduction to both OLAP and HDOLAP:

OLAP:
OLAP systems organize data into multidimensional structures called cubes. These cubes typically consist of dimensions
(attributes or categories along which the data is analyzed) and measures (numeric data that is analyzed).
Users can navigate through different levels of aggregation, drill down into details, pivot dimensions, and slice and dice data to
analyze it from various perspectives.
OLAP operations include roll-up (aggregating data from lower levels to higher levels), drill-down (breaking down aggregated
data into more detailed levels), slice (selecting a subset of data based on one or more dimensions), and dice (selecting a subset of
data based on multiple dimensions).
OLAP systems are characterized by their ability to provide fast query response times, even with large datasets and complex
queries.
.
High-Dimensional OLAP (HDOLAP):
HDOLAP methods address the challenges posed by datasets with a large number of dimensions, which can lead to the "curse
of dimensionality" in traditional OLAP systems.
Traditional OLAP systems may struggle to efficiently handle high-dimensional data due to the exponential increase in the size
of the data cube as the number of dimensions grows.
HDOLAP methods employ techniques such as dimension reduction, sparsity exploitation, and specialized indexing structures
to optimize storage and query processing for high-dimensional datasets.
These methods aim to preserve the analytical capabilities of traditional OLAP while addressing the scalability and
performance issues associated with high-dimensional data.
Further Development of Data Cube and OLAP method

Further development in the field of data cube and OLAP methods involves advancements aimed at enhancing their capabilities,
scalability, and performance to meet the evolving needs of data analysis and decision support. Here are some areas of further
development:

Advanced Query Optimization: Improving query processing and optimization techniques to handle complex analytical queries
more efficiently, particularly in scenarios involving large datasets and high-dimensional data cubes. This may include the
development of innovative algorithms for query rewriting, indexing structures, and caching mechanisms.

Real-Time OLAP: Integrating real-time data processing capabilities into OLAP systems to support analysis of streaming data and
enable timely decision-making. This involves developing techniques for ingesting, processing, and analysing data streams in near
real-time, while still providing interactive and ad-hoc query capabilities.

In-Memory OLAP: Expanding the use of in-memory computing technologies to accelerate data cube processing and analysis. In-
memory OLAP systems store data entirely in RAM, reducing disk I/O latency and enabling faster query response times. Further
development in this area involves optimizing memory utilization, data compression techniques, and memory management
strategies to handle large datasets.

Parallel and Distributed OLAP: Leveraging parallel and distributed computing architectures to scale OLAP systems horizontally
and handle massive volumes of data. This includes developing parallel query processing algorithms, distributed storage
mechanisms, and coordination protocols for distributed OLAP processing across multiple nodes or clusters.
Advanced Analytics Integration: Integrating advanced analytics and machine learning techniques directly into OLAP systems
to enable more sophisticated analysis and predictive modeling. This involves providing built-in support for statistical analysis,
data mining algorithms, predictive modeling, and anomaly detection within OLAP environments.

Semantic OLAP: Enhancing OLAP systems with semantic capabilities to enable more intuitive and context-aware data analysis.
Semantic OLAP involves incorporating semantic metadata, ontologies, and knowledge graphs into OLAP models to enrich data
semantics, support natural language querying, and enable more intelligent query understanding and interpretation.

User Interface and Visualization Enhancements: Improving the user interface and visualization capabilities of OLAP tools to
enhance user experience and facilitate exploratory data analysis. This includes developing interactive and intuitive visualization
techniques, support for dynamic dashboards, and integration with modern BI and data visualization platforms.

Cloud-Based OLAP Solutions: Developing cloud-native OLAP solutions that leverage the scalability, elasticity, and cost-
effectiveness of cloud computing platforms. Cloud-based OLAP offerings enable organizations to deploy and manage OLAP
systems as scalable services, with features such as auto-scaling, pay-as-you-go pricing models, and seamless integration with
other cloud services.
How Data Cube and OLAP are related with each other

A data cube and OLAP (Online Analytical Processing) are closely related concepts in the realm of data analysis and business
intelligence.

Data Cube:

A data cube is a multi-dimensional representation of data used to analyze and visualize complex datasets. It organizes data into
a structure with multiple dimensions, such as time, geography, product, and so on.
Each dimension represents a different attribute or perspective of the data, and the intersections of these dimensions form cells
in the cube, each containing aggregated or summarized data.
Data cubes allow for efficient querying, slicing, dicing, and drilling down into the data to gain insights from various angles.
OLAP (Online Analytical Processing):

OLAP refers to a category of technologies and tools used for analyzing and manipulating multidimensional data from multiple
perspectives.
OLAP enables users to perform complex analyses interactively, including slicing and dicing data, drilling down into details,
rolling up summary data, and pivoting dimensions.
OLAP systems typically operate on data cubes or similar multidimensional data structures, providing a fast and flexible
environment for decision support and business intelligence.
Relationship:
Data cubes are often used as the underlying data structure in OLAP systems. OLAP systems leverage the multidimensional
nature of data cubes to provide users with interactive and intuitive interfaces for analyzing data.
In essence, data cubes provide the structure and organization necessary for OLAP operations. OLAP tools interact with data
cubes to facilitate analytical tasks, such as ad-hoc querying, reporting, and data visualization, making it easier for users to
explore and understand complex datasets.

In summary, data cubes provide the multidimensional data model, while OLAP systems leverage this model to provide
analytical capabilities for interactive data analysis. Together, they enable users to analyze large volumes of data from different
perspectives and gain valuable insights for decision-making.
•Attribute-Oriented Induction

•(AOI) is a machine learning technique used for knowledge discovery and data mining. It's particularly useful for discovering patterns and relationships within
datasets characterized by attributes or features.

•Here's how Attribute-Oriented Induction works:

•Attribute Selection: AOI begins by selecting a target attribute or set of attributes that represent the outcome or class of interest. These attributes are typically
categorical or discrete in nature.

•Attribute Construction: Next, AOI constructs attribute concepts or patterns based on the selected target attributes. These attribute concepts represent
patterns of attributes that are associated with specific classes or outcomes.

•Rule Generation: AOI generates rules or patterns that describe relationships between attribute concepts and the target classes. These rules are often in the
form of "if-then" statements, where the antecedent specifies conditions on attribute values and the consequent specifies the predicted class or outcome.

•Rule Evaluation: The generated rules are evaluated based on various criteria such as accuracy, coverage, and simplicity. Rules that meet predefined criteria
are retained as potential knowledge discovered from the dataset.

•Rule Refinement: AOI may involve iterative refinement of rules to improve their quality and generalization. This may include pruning rules to eliminate
redundancy or overfitting, as well as refining conditions to increase their predictive power.

•Rule Application: Finally, the discovered rules can be applied to new instances or datasets to predict their class labels or outcomes based on their attribute
values.

Attribute-Oriented Induction is commonly used in areas such as data mining, knowledge discovery, and decision support systems. It's particularly effective for discovering
interpretable patterns and rules from datasets with categorical or discrete attributes, making it valuable for tasks such as classification, pattern recognition, and rule-based
reasoning.
Attribute-Oriented Induction
The Attribute-Oriented Induction (AOI) approach to data generalization and summarization – based
characterization was first proposed in 1989 (KDD ‘89 workshop) a few years before the
introduction of the data cube approach.

The data cube approach can be considered as a data warehouse – based, pre computational –
oriented, materialized approach.

It performs off-line aggregation before an OLAP or data mining query is submitted for processing.

On the other hand, the attribute oriented induction approach, at least in its initial proposal, a
relational database query – oriented, generalized – based, on-line data analysis technique.

However, there is no inherent barrier distinguishing the two approaches based on online
aggregation versus offline precomputation.
Some aggregations in the data cube can be computed on-line, while off-line precomputation of
multidimensional space can speed up attribute-oriented induction as well.

It was proposed in 1989 (KDD ‘89 workshop).

It is not confined to categorical data nor particular measures.

How it is done?
Collect the task-relevant data( initial relation) using a relational database query
Perform generalization by attribute removal or attribute generalization.

Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.
Reduces the size of the generalized data set.
Interactive presentation with users.

Basic Principles Of Attribute Oriented Induction

Data focusing:
Analyzing task-relevant data, including dimensions, and the result is the initial relation.

Attribute-removal:
To remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s
higher-level concepts are expressed in terms of other attributes.

Attribute-generalization:
If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator
and generalize A.
Attribute-threshold control:
Typical 2-8, specified/default.

Generalized relation threshold control (10-30):

To control the final relation/rule size.
Algorithm for Attribute Oriented Induction
InitialRel:
It is nothing but query processing of task-relevant data and deriving the initial relation.

PreGen:
It is based on the analysis of the number of distinct values in each attribute and to determine the generalization plan for
each attribute: removal? or how high to generalize?

PrimeGen:
It is based on the PreGen plan and performing the generalization to the right level to derive a “prime generalized relation”
and also accumulating the counts.

Presentation:
User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.

Example
Let's say there is a University database that is to be characterized, for that its corresponding DMQL will be

use University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone_no, GPA
from student
Its corresponding SQL statement can be:

Select name, gender, major, birth_place, birth_date, residence, phone_no, GPA

from student
where status in {“Msc”, “MBA”, “Ph.D.” }

Now for this database let's create a characterized view:

InitialRel:
From this table, we are querying task-relevant data.
From this table, we also removed a few attributes like name and phoneno, because they make no sense in concluding insights.

PreGen
Now, we have generalized these results by removing a few attributes and retaining important attributes.

And also we have generalized a few attributes by naming them "Country" rather than
"Birth_Place", "Age Range" rather than
"Birth_data", "City" rather than
"Residence" and so on as per the table given below.

DATA Ware House & Mining NOTES
100% (2)
DATA Ware House & Mining NOTES
31 pages
Data Warehouse Architecture-Group 1
No ratings yet
Data Warehouse Architecture-Group 1
6 pages
Unit II-DM
No ratings yet
Unit II-DM
54 pages
DW Part A Part B Notes
No ratings yet
DW Part A Part B Notes
69 pages
Unit 1
No ratings yet
Unit 1
18 pages
DW Content
No ratings yet
DW Content
6 pages
Unit 1
No ratings yet
Unit 1
18 pages
DMW Unit 1
No ratings yet
DMW Unit 1
56 pages
DWM 1
No ratings yet
DWM 1
15 pages
DMW p1 Merged
No ratings yet
DMW p1 Merged
316 pages
1 Unit
No ratings yet
1 Unit
46 pages
Unit II-DM
No ratings yet
Unit II-DM
53 pages
Data Warehouse 9 Oct
No ratings yet
Data Warehouse 9 Oct
15 pages
Abinitio Interview
100% (6)
Abinitio Interview
70 pages
Data Warehousing and Data Mining Original Notes
No ratings yet
Data Warehousing and Data Mining Original Notes
47 pages
Unit One
No ratings yet
Unit One
41 pages
DW Unit 1
No ratings yet
DW Unit 1
29 pages
Data Warehouse
No ratings yet
Data Warehouse
11 pages
CS2202 DataWarehouse OLAP
No ratings yet
CS2202 DataWarehouse OLAP
49 pages
Power BI Training
No ratings yet
Power BI Training
38 pages
Unit 3 - Notes
No ratings yet
Unit 3 - Notes
20 pages
Datawarehouse Unit-2
No ratings yet
Datawarehouse Unit-2
59 pages
Data Warehouse Architecture
No ratings yet
Data Warehouse Architecture
4 pages
Data Warehousing, Business Analytics and Online Analytical - 1
No ratings yet
Data Warehousing, Business Analytics and Online Analytical - 1
35 pages
Introduction To Data Warehouse Edited
No ratings yet
Introduction To Data Warehouse Edited
34 pages
Data Warehouse Architecture
No ratings yet
Data Warehouse Architecture
8 pages
CS 2208 Data Mining and Warehousing Notes
No ratings yet
CS 2208 Data Mining and Warehousing Notes
14 pages
Data Ware House Architectures
No ratings yet
Data Ware House Architectures
34 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
1.1 Basic Concepts & Architecture
No ratings yet
1.1 Basic Concepts & Architecture
27 pages
Unit - I DW
No ratings yet
Unit - I DW
12 pages
1 & 2 Data Warehousing - 021052
No ratings yet
1 & 2 Data Warehousing - 021052
80 pages
Business Intelligence: Lecture # 1
No ratings yet
Business Intelligence: Lecture # 1
30 pages
Data Warehousing
No ratings yet
Data Warehousing
71 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Business Mid Sem
No ratings yet
Business Mid Sem
6 pages
Assignment 1
No ratings yet
Assignment 1
15 pages
Lect 5 Data Warehousing I - 240924 - 033406
No ratings yet
Lect 5 Data Warehousing I - 240924 - 033406
38 pages
Big Query
No ratings yet
Big Query
8 pages
Data Warehouse
No ratings yet
Data Warehouse
143 pages
Data Warehouse Week 1
No ratings yet
Data Warehouse Week 1
78 pages
CH 2 Introduction To Data Warehousing
No ratings yet
CH 2 Introduction To Data Warehousing
31 pages
2024 Meeting 1 - Data Warehouse Fundamentals
No ratings yet
2024 Meeting 1 - Data Warehouse Fundamentals
47 pages
Unit - 1 Introduction To Data Warehousing
No ratings yet
Unit - 1 Introduction To Data Warehousing
57 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
34 pages
All Domains Real Project Explanation IMP
94% (18)
All Domains Real Project Explanation IMP
4 pages
Power Query
No ratings yet
Power Query
1,329 pages
Unit-2: Multi-Dimensional Data Model?
No ratings yet
Unit-2: Multi-Dimensional Data Model?
21 pages
BI - Unit 4
No ratings yet
BI - Unit 4
10 pages
DWDM Notes - Final
No ratings yet
DWDM Notes - Final
46 pages
Data Warehouse - Final
No ratings yet
Data Warehouse - Final
28 pages
DWDM
No ratings yet
DWDM
15 pages
DATA Ware House Mining NOTES
No ratings yet
DATA Ware House Mining NOTES
31 pages
KPMG Example Data Dictionary
No ratings yet
KPMG Example Data Dictionary
1,205 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
41 pages
Advanced Database Presentation
No ratings yet
Advanced Database Presentation
11 pages
Course Overview: What Is Data Warehouse
No ratings yet
Course Overview: What Is Data Warehouse
75 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
26 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
43 pages
DM104 - Evaluation of Business Performance
No ratings yet
DM104 - Evaluation of Business Performance
15 pages
Overview of Data Warehousing and OLAP
No ratings yet
Overview of Data Warehousing and OLAP
12 pages
DataStage PPT
No ratings yet
DataStage PPT
94 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
Data Warehousing - Architecture - Tutorialspoint
No ratings yet
Data Warehousing - Architecture - Tutorialspoint
7 pages
DWDM Notes/Unit 1
No ratings yet
DWDM Notes/Unit 1
31 pages
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
37 pages
Pramod Kumar - Talend Developer - Bangalore
No ratings yet
Pramod Kumar - Talend Developer - Bangalore
5 pages
Checklist For ETL Testing in Data Integration Testing Project
100% (2)
Checklist For ETL Testing in Data Integration Testing Project
1 page
ETL Process
No ratings yet
ETL Process
11 pages
Aberdeen Research Report Big Data Analytics
0% (1)
Aberdeen Research Report Big Data Analytics
11 pages
Introduction To Data Warehousing Concepts
No ratings yet
Introduction To Data Warehousing Concepts
8 pages
02-Lesson Plan A SEC CCS341 DATA WAREHOUSING1
No ratings yet
02-Lesson Plan A SEC CCS341 DATA WAREHOUSING1
4 pages
Lab1 Dimensional Modeling
100% (1)
Lab1 Dimensional Modeling
13 pages
Data Warehousing: Learning Objectives For Chapter 3
100% (1)
Data Warehousing: Learning Objectives For Chapter 3
24 pages
Managerial Accounting and Production Performance Analysis System
No ratings yet
Managerial Accounting and Production Performance Analysis System
6 pages
Shiva DE Resume
No ratings yet
Shiva DE Resume
6 pages
Data Wrangling Tools
No ratings yet
Data Wrangling Tools
3 pages
Bharathi Python
No ratings yet
Bharathi Python
8 pages
BBT BFT-676 BHT 656 Rm-U 3
No ratings yet
BBT BFT-676 BHT 656 Rm-U 3
40 pages
Migration Tools Used While Converting SAP ECC To
No ratings yet
Migration Tools Used While Converting SAP ECC To
12 pages
Debugging and Troubleshooting SSIS Packages
No ratings yet
Debugging and Troubleshooting SSIS Packages
24 pages
Lecture 4 - 6
No ratings yet
Lecture 4 - 6
18 pages
Charan Yeturu - Business Analyst
No ratings yet
Charan Yeturu - Business Analyst
8 pages
Unit 1 RM BBA
No ratings yet
Unit 1 RM BBA
22 pages
Data Warehouse Data Modeling and ETL Designs.
No ratings yet
Data Warehouse Data Modeling and ETL Designs.
6 pages
Mobile: +91 8885126957: Curriculumvitae Jonnalagadda Nikhitha
No ratings yet
Mobile: +91 8885126957: Curriculumvitae Jonnalagadda Nikhitha
4 pages
Bench Resources
No ratings yet
Bench Resources
15 pages
Impact of Overstimulation On Attention Span in Humans
No ratings yet
Impact of Overstimulation On Attention Span in Humans
10 pages
Pentaho Tutorial - Pentaho Data Integration (PDI) Tutorial
No ratings yet
Pentaho Tutorial - Pentaho Data Integration (PDI) Tutorial
13 pages
Information Systems,: Principles of
No ratings yet
Information Systems,: Principles of
14 pages
Snowflake Sample Resume3
No ratings yet
Snowflake Sample Resume3
4 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet

Unit 2 Data Mining & Warehouse

Uploaded by

Unit 2 Data Mining & Warehouse

Uploaded by

Unit 2

2. ETL (Extract, Transform, Load) Process:

4. Data Warehouse Database:

•Multi-way Array Aggregation: This technique involves aggregating data

•Bottom-Up Computation (BUC) Algorithm: BUC is a method for

Here's how Star-Cubing typically works:

•Grouping of data in a multidimensional

Here's a brief introduction to both OLAP and HDOLAP:

•Here's how Attribute-Oriented Induction works:

It was proposed in 1989 (KDD ‘89 workshop).

It is not confined to categorical data nor particular measures.

Basic Principles Of Attribute Oriented Induction

Generalized relation threshold control (10-30):

Select name, gender, major, birth_place, birth_date, residence, phone_no, GPA

Now for this database let's create a characterized view:

You might also like